Numerical Methods for Ordinary Differential 
Equations 


C. Vuik 
FJ. Vermolen 
M.B. van Gijzen 
M.J. Vuik 


©VSSD 
First edition 2007 
Second edition 2015 


© 2023 TUDelft Open 

ISBN 978-94-6366-665-7 (Ebook) 
ISBN 978-94-6366-664-0 (Paperback) 
Doi: https://doi.org/10.5074/t.2023.001 


This work is licensed under a 
Creative Commons Attribution 4.0 
International license 


TU Detft OPEN 


Keywords: numerical analysis, ordinary differential equations 


iti 


Preface 


In this book we discuss several numerical methods for solving ordinary differential equations. 
We emphasize the aspects that play an important role in practical problems. We confine ourselves 
to ordinary differential equations with the exception of the last chapter in which we discuss the 
heat equation, a parabolic partial differential equation. The techniques discussed in the intro- 
ductory chapters, for instance interpolation, numerical quadrature and the solution to nonlinear 
equations, may also be used outside the context of differential equations. They have been in- 
cluded to make the book self-contained as far as the numerical aspects are concerned. Chapters, 
sections and exercises marked with a * are not part of the Delft Institutional Package. 


The numerical examples in this book were implemented in Matlab, but also Python or any other 
programming language could be used. A list of references to background knowledge and related 
literature can be found at the end of this book. Extra information about this course can be found 
at http: //NMODE.ewi.tudelft .nl, among which old exams, answers to the exercises, and a link 
to an online education platform. We thank Matthias MGller for his thorough reading of the draft 
of this book and his helpful suggestions. 


Delft, June 2016 
C. Vuik 


The figure at the cover shows the Erasmus bridge in Rotterdam. Shortly after the bridge became 
operational, severe instabilities occurred due to wind and rain effects. In this book we study, 
among other things, numerical instabilities and we will mention bridges in the corresponding 
examples. Furthermore, numerical analysis can be seen as a bridge between differential equa- 
tions and simulations on a computer. 
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Chapter 1 


Introduction 


1.1 Some historical remarks 


Modern applied mathematics started in the 17th and 18th century with scholars like Stevin, 
Descartes, Newton and Euler. Numerical aspects found a natural place in the analysis but the 
expression “numerical mathematics” did not exist at that time. However, numerical methods 
invented by Newton, Euler and at a later stage by Gauss still play an important role even today. 


In the 17th and the 18th century fundamental laws were formulated for various subdomains of 
physics, like mechanics and hydrodynamics. These laws took the form of simple looking math- 
ematical equations. To the disappointment of many scientists, these equations could be solved 
analytically in a few special cases only. For that reason technological development has only been 
loosely connected with mathematics. The introduction and availability of digital computers has 
changed this. Using a computer it is possible to gain quantitative information with detailed and 
realistic mathematical models and numerical methods for a multitude of phenomena and pro- 
cesses in physics and technology. Application of computers and numerical methods has become 
ubiquitous. Statistical analysis shows that non-trivial mathematical models and methods are 
used in 70% of the papers appearing in the professional journals of engineering sciences. 


Computations are often cheaper than experiments; experiments can be expensive, dangerous or 
downright impossible. Real life experiments can often be performed on a small scale only, which 
makes their results less reliable. 


1.2. What is numerical mathematics? 


Numerical mathematics is a collection of methods to approximate solutions to mathematical 
equations numerically by means of finite computational processes. 


In large parts of mathematics the most important concepts are mappings and sets. In numerical 
mathematics the concept of computability should be added. Computability means that the result 
can be obtained in a finite number of operations (so the computation time will be finite) on a 
finite subset of the rational numbers (because a computer has only finite memory). 


In general the result will be an approximation of the solution to the mathematical problem, since 
most mathematical equations contain operators based on infinite processes, like integrals and 
derivatives. Moreover, solutions are functions whose domain and image may (and usually do) 
contain irrational numbers. 


Because, in general, numerical methods can only obtain approximate solutions, it makes sense 
to apply them only to problems that are insensitive to small perturbations, in other words to 
problems that are stable. The concept of stability belongs to both numerical and classical math- 
ematics. An important instrument in studying stability is functional analysis. This discipline 
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also plays an important role in error analysis (investigating the difference between the numerical 
approximation and the solution). 


Calculating with only a finite subset of the rational numbers has many consequences. For exam- 
ple: a computer cannot distinguish between two polynomials of sufficiently high degree. Conse- 
quently, methods based on the main theorem of algebra (i.e. that an nth degree polynomial has 
exactly n complex zeros) cannot be trusted. Errors that follow from the use of finitely many digits 
are called rounding errors (Section 1.4). 


An important aspect of numerical mathematics is the emphasis on efficiency. Contrary to or- 
dinary mathematics, numerical mathematics considers an increase in efficiency, i.e. a decrease 
of the number of operations and/or amount of storage required, as an essential improvement. 
Progress in this aspect is of great practical importance and the end of this development has not 
been reached yet. Here, the creative mind will meet many challenges. On top of that, revolutions 
in computer architecture will overturn much conventional wisdom. 


1.3. Why numerical mathematics? 


A big advantage of numerical mathematics is that it can provide answers to problems that do not 
admit closed-form solutions. Consider for example the integral 


7U 
‘f 1+ cos? xdx. 
0 


This is an expression for the arc length of one arc of the curve y(x) = sin x, which does not have 
a solution in closed form. A numerical method, however, can approximate this integral in a very 
simple way (Chapter 5). An additional advantage is that a numerical method only uses stan- 
dard function evaluations and the operations addition, subtraction, multiplication and division. 
Because these are exactly the operations a computer can perform, numerical mathematics and 
computers form a perfect combination. 


An advantage of analytical methods is that the solution is given by a mathematical formula. 
From this, insight in the behavior and the properties of the solution can be gained. For numerical 
approximations, however, this is not the case. In that case, visualization tools may be used to gain 
insight in the behavior of the solution. Using a numerical method to draw a graph of a function 
is usually a more useful tool than evaluating the solution at a large number of points. 


1.4 Rounding errors 


A computer uses a finite representation of the all numbers in R. These are stored in a computer 
in the form 


L0.dido...dn- BY, (1.1) 


in which, by definition, dj > 0 and 0 < d; < 6. The normalization is needed in order to prevent a 
waste of digits and to make the representation unambiguous. We call the value in equation (1.1) 
a floating point number (representation) in which 0.d,d> ...dy is called the mantissa, B the base and 
e (integer) the exponent, where L < e < U. Characteristic values for |L| and U are in the range 
[100, 1000], often, B = 2 (binary representation) and n = 24 (single precision) or n = 53 (double 
precision). Most computers and software packages (Matlab) satisfy the IEEE-754 standard, and 
hence provide single-! and double-precision? computations. 


Letforx€ R 
0.d1...dyn- B° <x < 0.dydo...(dn +1)- B’, 


lnttp://en.wikipedia.org/wiki/Single-precision_floating-point_format 
*http: //en.wikipedia. org/wiki/Double-precision_floating-point_format 
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where for simplicity x is taken positive, and we assume that d, < B — 1. Rounding x means that 
x will be replaced with the floating point number closest to x, which is called f1(x). The error 
caused by this process is called rounding error, and f1(x) is written as 


fl(x) =x(1+.e). (1.2) 


The value |x — fl(x)| = |ex| is called the absolute error and |x — fl(x)|/|x| = |e| the relative 
error. The difference between the floating point numbers enclosing x is 6°". Rounding yields 
|x — fl(x)| < B°-"/2, hence the absolute error is bounded by 


1 
ce pa ae 
lex <5 
Because |x| > B°~! (since d; > 0), the relative error is bounded by 


le| < eps (1.3) 


with the computer’s relative precision eps defined as 


eps = ae (1.4) 


From B = 2 and n = 53 it follows that eps ~ 10~1°, thus in double precision approximately 16 
decimal digits are used. 


Figure 1.1 shows the distribution of the floating point numbers 0.1dd3 - B® ; e = —1,0,1,2 in 
base 2 (binary numbers). These floating point numbers are not uniformly distributed and there is 
a gap around 0. A computational result lying within this neighborhood is called underflow. Most 
machines provide a warning, replace the result with 0 and continue. A computational result with 
absolute value larger than the largest floating point number that can be represented is called 
overflow. The machine warns and may continue with Inf’s (infinity). 


A final remark is that rounding errors are deterministic. If one repeats the algorithm, the same 
results are obtained. 


| 
| 
-4 -3 -2 -1 0 1 2 3 4 


Figure 1.1: Distribution of +0.1d2d3 - B°, B = 2,e = —1,0,1,2. 


Next, let us investigate how computers execute arithmetic operations in floating point arithmetic. 
Processors are very complex and usually the following model is used to simulate reality. Let o 
denote an arithmetic operation (+,—, x or /) and let x and y be floating point numbers (hence, 
x = fl(x), y = fl(y)). Then the machine result of the operation x 0 y will be 


z= fl(xoy). (1.5) 


The exact result of x o y will not be a floating point number in general, hence an error results. 
From formula (1.2) it follows that 


z={xoyh}(1+¢6), (1.6) 


for some ¢ satisfying (1.3) and z # 0. Expression (1.6) describes the error due to converting the 
result of an exact calculation to floating point form. 
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Next, we suppose that x and y are no floating point numbers: the corresponding floating point 
numbers fl(x) and fl(y) satisfy fl(x) = x(1+ 1), fl(y) = y(1+€2). Again, x o y should 
be calculated. Using the floating point numbers in this calculation, the absolute error in the 
calculated result f1(f1(x) o fl(y)) satisfies: 


|xoy— fIfI(x) 0 fl(y))| < |xoy— fI(x) 0 fI(y)| + [FI (x) 0 fI(y) — FICFI(x) 0 fl(y))|- 1-7) 


From this expression it follows that the error consists of two terms. The first term is caused by 
an error in the data and the second one by converting the result of an exact calculation to floating 
point form (as in equation (1.6)). 


A few examples are provided to show how rounding errors may affect the result of a calcula- 
tion. Furthermore, general computational rules will be presented regarding the propagation of 
rounding errors. 


Example 1.4.1 (Absolute and relative errors) 


In this example, x = 5/7 and y = 1/3 are taken, and the calculations are computed on a system 
that uses 6 = 10 and a precision of 5 digits. In Table 1.1 the results of various calculations applied 
to fl(x) = 0.71429 x 10° and fl(y) = 0.33333 x 10° can be inspected. 


Table 1.1: Absolute and relative error for various calculations. 


relative error 


0.182 x 10~ 
0.38096 x 10° 0.762 x 107° | 0.200 x 1074 


0.23809 x 10° 0.524 x 107-5 | 0.220 x 1074 
0.21429 x 10! 0.429 x 10-* | 0.200 x 1074 


In order to show how the values in de table are computed, the addition is expressed as 
f(x) + fl(y) = (0.71429 + 0.33333) x 10° = 0.1047620000... x 101. 
This result has to be rounded to 5 digits: 


fICfU(x) + fl(y)) = 0.10476 x 101. 
The exact value is x + y = 22/21 = 1.0476190476.... Therefore, the absolute error is 


1.0476190476 ...— 0.10476 x 10! = 0.190 x 1074, 


and the relative error is 


1.0476190476...— 0.10476 x 10! 


~ 4 
1.0476190476... ~ 0.182 x 10°". 


The error analysis of the other three operations follows the same strategy. 
Example 1.4.2 (Absolute and relative errors) 


In this example, the same numbers x and y and the same precision as in the previous example are 
used. Furthermore, u = 0.714251, v = 98765.1 and w = 0.111111 x 10-4, hence fl(u) = 0.71425, 
fl(v) = 0.98765 x 10° and w = 0.11111 x 10-4. These numbers are chosen in such a way that 
rounding error problems can be clearly illustrated. In Table 1.2, x — u has a small absolute error 
but a large relative error. By dividing x — u by a small number w or multiplying it with a large 
number v, the absolute error increases, whereas the relative error is not affected. On the other 
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Table 1.2: Absolute and relative error for various calculations. 


operation result exact value | absolute error | relative error 
4 4 


0.40000 x 10-4 | 0.34714 x 10~ 
0.36000 x 10! | 0.31243 x 10! 


0.39506 x 10! | 0.34286 x 10! 
0.98766 x 10° | 0.98766 x 10° 


hand, adding a larger number u to a small number ?v results in a large absolute error but only a 
small relative error. 


The first row in Table 1.2 is created using the exact results u = 0.714251 and 
x —u =5/7— 0.714251 = 0.3471428571... x 107+. 
The computed results, however, are fl(u) = 0.71425 x 10° and 
f(x) — fl(w) = 0.71429 — 0.71425 = 0.0000400000 x 10°. 
Normalization yields f1(fl(x) — fl(u)) = 0.40000 x 10-4 leading to the absolute error 
(x —u) — fI(fl(x) — fl(u)) = (0.3471428571 ...— 0.40000) x 10-4 = 0.528 x 107°. 


The relative error is 
0.52857... x 10-5 


—______—. = 0.152. 
0.34714...x 10-4 


It is interesting to note that the large relative error has nothing to do with the limitations of the 
floating point system (the subtraction of f1(x) and fl(u) is without error in this case) but is only 
due to the fact that the data are represented in no more than 5 decimal digits. The zeros that 
remain after normalization in the single precision result f/(fl(x) — fl(u)) = 0.40000 have no 
significance, only the digit 4 is significant; the zeros that are substituted are a mere formality 
and represent no information. This phenomenon is called loss of significant digits. The loss of 
significant digits has a large impact on the relative error, because of division by the small result. 


A large relative error will sooner or later have some unpleasant consequences in later stages of 
the process, also for the absolute error. For example, if x — u is multiplied by a large number, then 
a large absolute error is generated, together with the already existing large relative error. This can 
be seen in the third row of the table, where the exact result is (x — u) x v = 3.42855990000460.... 
Calculating fl(fl(x) — fl(u)) x fl(v) yields 


fI(fl(x) — fl(u)) x fl(v) = 0.4 x 10~* x 0.98765 x 10° = 0.3950600000 x 101. 


After rounding, the computed result is fI(fl(fl(x) — fl(u)) x fl(v)) = 0.39506 x 10!. This yields 
the absolute error 
3.42855990000460...— 0.39506 x 10! = 0.522, 
and the relative error is 
0.522... 
3.42855990... 


Suppose that y? is added to the result (x — u) x v. Because y = 1/3 and therefore y* = 1/9, the 
result of this operation is indistinguishable, due to the large absolute error. In other words, for 
the reliability of the result it does not make a difference if the last operation had been omitted 
(and the numerical process was altered). Therefore, something is fundamentally wrong in this 
case. 


= 0.152. 
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Almost all numerical processes exhibit loss of significant digits for a certain set of input data, 
one might call such a set ill conditioned. There are also numerical processes that exhibit these 
phenomena for all possible input data. Such processes are called unstable. One of the objectives 
of numerical analysis is to identify unstable processes and classify them as useless, or improve 
them in such a way that they become stable. 


Remark 
The standard mathematical laws (such as the commutative laws, associative laws, and the dis- 
tributive law) do not necessarily hold for floating point arithmetic. 


Computational rules for error propagation 

In the analysis of a complete numerical process, the accumulated error of all previous steps 
should be interpreted as a perturbation of the original data. Moreover, in the result of the sub- 
sequent step, the propagation of these perturbations should be taken into account together with 
the floating point error. In general, this error source will be more important than the floating 
point error after a considerable number of steps (in the previous example of (x — u) x v even 
after two steps!). In that stage the error in a numerical process will be largely determined by the 
‘propagation’ of the accumulated errors. The computational rules to calculate numerical error 
propagation are the same as those to calculate propagation of error in measurements in physical 
experiments. There are two rules: one for addition and subtraction and one for multiplication 
and division. 


The approximations of x and y will be denoted by * and # and the (absolute) perturbations by 
6x =x—KXand dy =y-— jf. 


a) For addition, the error is (x + y) — (+ 7) = (x -— £)+ (y— J) = 6x + oy. In other words, 
the absolute error in the sum of two perturbed terms is equal to the sum of the absolute 
perturbations. A similar rule holds for differences: (x — y) — (® — 7) = 6x — dy. The rule is 
often presented in the form of an inequality: |(x + y) — (£7)| < |éx| + |dy|. We call this 
inequality an error estimate. 


b 


~~ 


This rule does not hold for multiplication and division. Efforts to derive a rule for the 
absolute error will lead nowhere, but one may derive a similar rule for the relative error. The 
relative perturbations ¢, and ey, are defined as ¥ = x(1 +), and similarly for y. For the 
relative error in a product xy it holds that 


xy— GH — xy—x(1t+ex)y(tey) _ 
xy xy = 
assuming €, and ¢, are negligible compared to 1. In words: the relative error in a product of 
two perturbed factors is approximately equal to the sum of the two relative perturbations. 
A similar rule can be derived for division, where 
[xy = xy 
— < Jex| + ley]. 
Ixy| oan, 


—(€x + éy + Ex€y) we —(€y + ey), 


Identification of ¥ with f(x) and 7 with fl(y) makes it possible to explain various phenomena 
in floating point computations, using these two simple rules. 


1.5 Landau’s O-symbol 


In the analysis of numerical methods estimating the behavior of the error is of primary impor- 
tance. However, it is often more important to know the growth rate of the error rather than its 
explicit value which is impossible to compute in many practical problems. Landau’s O-symbol 
is a handy tool to compare the growth rate of two functions in the neighborhood of a point a. 
For all applications of the Landau O-symbol in this book, a = 0, which leads to the following 
(simplified) definition. 
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Definition 1.5.1 (O-symbol) Let f and g be given functions. Then f(x) = O(g(x)) for x — 0, if there 
exist positive r and finite M such that 


lf (x)| < M|g(x)| forall x € [-r,7]. 


To estimate errors, the following computational rules are used. 


Computational rules 
If f(x) = O(x?P) and g(x) = O(x1) for x > 0, with p > 0 and q > 0, then 


a) f(x) = O(x°) for alls withO <s < p. 

b) af (x) + Bg(x) = O(x™intP-4}) for all a, B € R. 
c) f(x)g(x) = O(xP*4). 

d) f(x)/|x|° = O(xP™) if0 <5 < p. 


Example 1.5.1 (O-symbol) 
Let f(x) = x* and g(x) = x. In this case: 
a) f(x) = O(x?) for x > 0, but also f(x) = O(x) for x > 0. 
b) Function h(x) = x* + 3x is O(x) for x + 0 but not O(x?) for x > 0. 
c) f(x) - g(x) is O(x3) for x > 0. 
d) f(x)/g(x) = xis O(x) for x > 0. 


1.6 Some important concepts and theorems from analysis 


In this section some important concepts and theorems from analysis are reviewed that are often 
used in numerical analysis. Note that C[a,b] is the set of all functions being continuous on the 
interval [a,b] and C? [a,b] is the set of all functions of which all derivatives up to the pth exist and 
are continuous on the interval (a, b). 


Definition 1.6.1 (Well posedness) Mathematical models of physical phenomena are called well posed 
if 

e A solution exists; 

e The solution is unique; 

e The solution’s behavior changes continuously with the data. 


Models that do not satisfy these conditions are called ill posed. 


Definition 1.6.2 (Well conditioned) Mathematical models are called well conditioned if small errors 
in the initial data lead to small errors in the results. Models that do not satisfy this condition are called ill 
conditioned. 


Definition 1.6.3 (Modulus of a complex number) The modulus of a complex number a + ib is equal 


to |a + ib| = Va + b?. Note that also |a — ib| = Va? + b?. 


Definition 1.6.4 (Kronecker delta) The Kronecker delta is defined as the following function of two in- 


teger variables: 
eS eee et 
y= 40 ig! 
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Definition 1.6.5 (Eigenvalues) Let A be an m x m matrix, and I be the m x m identity matrix. The 
eigenvalues \1,...,Am of A satisfy det(A — AI) = 0. 


Theorem 1.6.1 (Triangle inequality with absolute value) For any x,y € R, |x+y| < |x| + yl. 
Furthermore, for any function f : (a,b) > R, | f? F(xdax| < ft |f (x) |dx. 


Theorem 1.6.2 (Intermediate-value theorem) Assume f € C[a,b]. Let f(a) € f(b) and let F bea 
number between f(a) and f(b). Then there exists a number c € (a,b) such that f(c) = F. 


Theorem 1.6.3 (Rolle’s theorem) Assume f € C[a,b| and f differentiable on (a,b). If f(a) = f(b), 
then there exists a number c € (a,b) such that f'(c) =0. 


Theorem 1.6.4 (Mean-value theorem) Assume f € C[a,b] and f differentiable on (a,b), then there 
exists a number c € (a,b) such that 
f(b) = fla), 


fi 


Theorem 1.6.5 (Mean-value theorem for integration) Assume G € C[a,b| and ¢ is an integrable 
function that does not change sign on [a,b]. Then there exists a number x € (a,b) such that 


Theorem 1.6.6 (Weierstrass approximation theorem) Suppose f € C|a,b] is given. For every e > 0, 
there exists a polynomial function p such that for all x in |a, b], |f (x) — p(x)| <. 


Theorem 1.6.7 (Taylor polynomial) Assume f € C"+!(a,b) is given. Then for all c,x € (a,b) there 
exists a number ¢ between c and x such that 

f(x) = Pax) + R(x), 
in which the Taylor polynomial P,(x) is given by 


SE 2 
Pa(x) = fle) +(x o)f'(e) + SD pre) +... 4 SEP 


and the remainder term Ry (x) is: 


_ 7\ntl 
Ry(x) = Ff, 
Proof 
Take c,x € (a,b) with c 4 x and let K be defined as 
So A\2 n\n 
fl) = f+ e-of 9+ BP po +.. + PV pos Kary. a 
Consider the function 
(x1)? (x—#)" 


FOS{O=(Mt eof Oss sf Ott 7 skeen 


By (1.8) it follows that F(c) = 0 and F(x) = 0. Hence, by Rolle’s theorem there exists a number ¢ 
between c and x such that F’(¢) = 0. Further elaboration yields 


PO =£O+"'@e-9-F@O)+{GPe-o-soe-ah +. 


(n+1) (n) 
4 {2 +1 (¢) (x—@)" - id 9) aa] — K(n+1)(x—@)" 


n! 


(n41) 
= PO ey Kin t 1)(x-2)" =0. 


n! 
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Hence, 
pede) 
(n+1)!’ 
which proves the theorem. xX 


Theorem 1.6.8 (Taylor polynomial of two variables) Let f : D C R* +> R be continuous with 
continuous partial derivatives up to and including order n+ 1 ina ball B C D with center c = (cy,c2) 


and radius p: B = {(x1,X2) € IR? : \/(x1 — cy)? + (x2 — c2)2 < p}. Then for each x = (x1,x2) € B 


there exists a 0 € (0,1), such that 
F(x) = Pa(x) + Rn(x), 


in which the Taylor polynomial Py(x) is given by 


2 2, 2 n 
wal sy > sare a (x. = Cj, (Xe = Gi) ies (.. me a 


Ox}, OX;, wee OXi, 


2 2 gutl 
Ri(x)= Gil yi Deke te ee ie aye ~ bina) ge (+ (x0), 
“ih a n+1 


Proof 
Let x € B be fixed, and h = x — c, hence ||h|| < p. Define the function F : (—1,1) + R by: 


F(s) = f(c+sh). 
Because of the differentiability conditions satisfied by f in the ball B, F € C"*1(—1,1) and F¥(s) 
is given by (verify) 


ae SOFC SI) is es 
- OX}, OX}... OX, sia ala 


Expand F into a Taylor polynomial about 0. This yields 


, gt 2 gttl pai 
F(s) = F(0)+sF (0) +...+ TF Gene (0s), 
for some 6 € (0,1). By substituting s = 1 into this expression and into the expressions for the 
derivatives of F, the result follows. xX 


Example 1.6.1 (Taylor polynomial of two variables) 
For n = 1, the Taylor polynomial equals 


P(x) = Ferien) + (a1 = er) $E(ersc0) + (2 ~ 62) 5 (102), 


and for the remainder term: Rj (x) is O(||x — ¢||*). 
Theorem 1.6.9 (Power series of 1/(1—x)) Let x € Rwith |x| < 1. Then: 


1 2 443 Uk 
7 ear ee +x Pees 
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Theorem 1.6.10 (Power series of e*) Let x € R. Then: 


1.7. Summary 


In this chapter, the following subjects have been discussed: 


- Numerical mathematics; 

- Rounding errors; 

- Landau’s O-symbol; 

- Some important concepts and theorems from analysis. 


1.8 Exercises 


1. Let f(x) = x°. Determine the second-order Taylor polynomial of f about the point x = 1. 
Compute the value of this polynomial in x = 0.5. Give an upper bound for the error and 
compare this with the actual error. 


2. Let f(x) = e*. Give the nth order Taylor polynomial about the point x = 0, and the remain- 
der term. How large should n be chosen in order to make the error less than 10~® in the 
interval [0, 0.5]? 


3. We use the polynomial Pj(x) = 1 —.x*/2 to approximate f(x) = cosx in the interval 
[—1/2, 1/2]. Give an upper bound for the error in this approximation. 


4. Letx = 1/3, y = 5/7. We calculate with a precision of three (decimal) digits. Express x 
and y as floating point numbers. Compute fl(fl(x) o fl(y)), xo y and the rounding error 
taking o = +, —, *, / respectively. 


5. Write a program to determine the machine precision of your system. 
Hint: start with p = 1. Divide p repeatedly by a factor 2 until the test 1+ = 1 holds. The 
resulting value is close to the machine precision of your system. 


Chapter 2 


Interpolation 


2.1 Introduction 


In practical applications it is frequently the case that only a limited amount of measurement data 
is available, from which intermediate values should be determined (interpolation) or values out- 
side the range of measurements should be predicted (extrapolation). An example is the number 
of cars in The Netherlands, which is tabulated in Table 2.1 (per 1000 citizens) from 1990 every fifth 
year up to 2010. How can these numbers be used to estimate the number of cars in intermediate 
years, e.g. in 2008, or to predict the number in the year 2020? 


Table 2.1: Number of cars (per 1000 citizens) in The Netherlands (source: Centraal Bureau voor 
de Statistiek). 


ear 1990 | 1995 | 2000 | 2005 | 2010 
MA 


344 | 362 | 400 [ 429 | 460 


In this chapter several interpolation and extrapolation methods will be considered. 


A second example of the use of interpolation techniques is image visualization. By only storing a 
limited number of pixels, much memory can be saved. In that case, an interpolation curve should 
be constructed in order to render a more or less realistic image on the screen. 


A final application is the computation of trigonometric function values on a computer, which is 
time consuming. However, if a number of precalculated function values is stored in memory, the 
values at intermediate points can be determined from these in a cheap way. 


2.2 Linear interpolation 


The simplest way to interpolate is zeroth degree (constant) interpolation. Suppose the function 
value at a certain point is known. Then, an approximation in the neighborhood of this point is 
set equal to this known value. A well known example is the prediction that tomorrow’s weather 
will be the same as today’s. This prediction appears to be correct in 80% of all cases (and in the 
Sahara this percentage is even higher). 


A better way of interpolation is a straight line between two points (see Figure 2.1). Suppose that 
the function values f(x9) and f(x) at the points x9 and x; are known (x9 < xj). In that case, 
it seems plausible to take the linear function (a straight line) through (x9, f(x0)), (x1, f(x1)), in 
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0. 
35 1 15 
xo x x1 


Figure 2.1: Linear interpolation polynomial, L;, of the function f (x) = 1/x, x9 = 0.6, x = 0.95, 
xy = 1.3. 


order to approximate! f(x),x € [xo, x1]. Such a function is called a linear interpolation polynomial, 
and can be expressed as 


xX — X09 


Ly (x) = f(x0) + (fF (x1) — f(x0)). (2.1) 


X1 — XQ 


The mathematical derivation of this formula uses the following properties: 
1. Ly is linear, hence it can be written as Ly (x) = co + ¢1%; 


2. L; interpolates the given data, hence it holds that Ly(xo) = f(x9) and L;(x1) = f (x1). This 
is called the interpolation property. 


These properties lead to the following linear system of equations: 


(2) (3) = Ged): 7 


from which the unknown coefficients cg and c; can be determined. 


The need to solve (2.2) can be overcome by making use of the so-called Lagrange basis polynomials 
Li1(x),i = 0,1. Loi (x) and L1 (x) are defined as linear polynomials such that 


Loi (x0) = 1, Loi (a1) = 0, Lii(%o) = 0, Li (m1) = 1. 
From these properties it follows that 


x-—X x— Xo 


Loi(x) = 


d L a : 
xx 1%) x1 — Xo 


The linear interpolation polynomial L(x) can be written as 


L(x) = Loi(x) f (xo) + Li (x) f (x1), 


lWhenever the word ‘approximation’ occurs in this chapter, we mean that the unknown function value is approxi- 
mated using an interpolation polynomial. 
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which is called the Lagrange form of the interpolation polynomial, or the Lagrange interpolation 
polynomial for short. The interpolation points x9 and x, are often called nodes. 


Theorem 2.2.1 provides a statement for the interpolation error if the interpolated function is at 
least twice continuously differentiable. 


Theorem 2.2.1 Let xq and x, be distinct nodes in [a,b], x9 < x1, f € C*{a,b] and let Ly be the linear 
interpolation polynomial that equals f at the nodes xo, x. Then for each x € [a,b] there exists a & € (a,b) 
such that 


f(x) —La(x) = 5(a— x0) (e- ) F"). 23) 


Proof 
If x = x9 or x = x4, then f(x) — L1(x) = 0 and é can be chosen arbitrarily. Assume x # xo and 
x # x1. For each x there exists a constant q such that 


f(x) — Lie) = q(x — x0) (2 — x1). 


To find an expression for q consider the function 


p(t) = f(t) — La(t) — q(t — x0) (t — 21). 
¢ satisfies p(xo) = p(x1) = v(x) = 0. By Rolle’s theorem, there exist at least two points y and z 
in (a,b) such that g'(y) = g'(z) = 0. Again by Rolle’s theorem there is a € between y and z and 
hence € € (a,b) such that g"’(€) = 0. Because g(t) = f’"(t) — 2q this means that q = 3 f"(@). & 
If x ¢ [xo, x1], then the linear polynomial (2.1) can be used to extrapolate. Relation (2.3) is still the 
correct expression for the error. 


From Theorem 2.2.1, an upper bound for the linear interpolation error follows: 
1 
F(x) = L(x) < go — x0)” max |f"(2)|- (2.4) 


In many practical applications the values f (xo) and f(x,) are a result of measurements (per- 
turbed) or calculations (round-off errors). Assume that the measurement error is bounded by 
e, that is, | f (x9) — f(xo)| < ¢ and |f (x1) — f(x1)| < ¢, where the* (hat) refers to the measured, 
hence available data. The difference between the exact linear interpolation polynomial L; and 
the perturbed polynomial 1 is bounded by 


br = x1 + [e201 


L —f < 
Ita(x) ~ La(x)| < 


(2.5) 
For x € [x9,X] (interpolation), the error resulting from uncertainties in the measurements is 
bounded by «. 


However, if x ¢ [x9,x1] (extrapolation), the error can grow to arbitrarily large values when 
moving away from the interval [x9,x1]. Suppose that x > x1, then the additional inaccuracy 
is bounded by 


|L1(x) — £4(x)| < (1+22 ae (2.6) 
x1 — Xo 

In Figure 2.2, the impact of the rounding or measurement errors on linear interpolation and ex- 

trapolation is visualized. It graphically illustrates the danger of extrapolation, where the region 

of uncertainty, and hence, the errors due to uncertainties in the data, make the extrapolation error 

large. Note that the truncation error in equation (2.3) also becomes arbitrarily large outside the 


interval [x9, x1]. 
The total error is the sum of the interpolation/extrapolation error and the measurement error. 
Example 2.2.1 (Linear interpolation) 


In Table 2.2, the value of the sine function is presented for 36° and 38°. The approximation for 
37° using the linear interpolation polynomial provides a result of 0.601723. The difference with 
the exact value is only 0.9 x 1074. 
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— Exact 
Lower limit 


- - - Upper limit 


-1 -0.5 0 0.5 1 


Figure 2.2: Interpolation and extrapolation errors resulting from measurement errors, using, as 
an example, x9 = 0,x1 = 1, f (xo) = 1, f(x1) = 2, and ¢ = 1/2. The region of uncertainty is filled. 
Table 2.2: The value of sin w. 


sin a 
36° | 0.58778525 


37° | 0.60181502 
38° | 0.61566148 


2.3 Higher-order Lagrange interpolation 


Suppose that n + 1 function values are provided. Instead of using linear interpolation, f(x) can 
be approximated by a polynomial L,(x) of degree at most n, such that the values of f and Ly 
at the n + 1 different points x9,...,Xn coincide. Ly is called an nth degree Lagrange interpolation 
polynomial. 


The polynomial L,,(x) that satisfies these constraints has the following properties: 


1. Ln(x) is a polynomial of degree n, hence it can be written as 
n 
toe me cx, 
k=0 


2. The values at the interpolation points coincide: Ln(xj) =f (xj), f= 0c: 


These properties lead to the following linear system: 


I: aie UNG. ene en Co f (x0) 
il, epee’ | Sanse “a Gi f (Xn) 


from which the unknown coeffients c;,1 = 0,...,m, can be determined. This matrix, in which 
each row consists of the terms of a geometric series, is called a Vandermonde matrix. 
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It is again possible to avoid the linear system of equations by using Lagrange basis polynomials. 
To this end, the nth degree interpolation polynomial is written as 


n 


En(%) = f(a) Lin(®), (2.7) 


k=0 
in which the nth degree Lagrange basis polynomials satisfy Lkn(x;) = 44; in the nodes x; (see 
definition 1.6.4). The Lagrange basis polynomials L,.,(x) are explicitly given by 
Lag) = OTH) sea aa) ies 
(Xk — X0) + Xk — Xe-1) (Xk — Fe1) °° Lk — Xn) 


and can also be written as: 


Theorem 2.2.1 can be generalized: 


Theorem 2.3.1 Let xo,...,Xn be distinct nodes in (a, b|, x9 < x1 < +++ < Xn. Let f € C"*[a,b] and 
let Ly be the Lagrange polynomial that equals f at these nodes. Then for each x € [a,b] there exists a 
€ € (a,b) such that 


fe) 


f(x) — Ln(x) = (x — x9) +++ (4 — Xn) (n+1)!° 


Proof: 
The proof is completely analogous to that of Theorem 2.2.1. X 


If Lagrange interpolation is used on tabulated values, then the best results are obtained by choos- 
ing the nodes in such a way that x is in the (or an) innermost interval. 


Suppose that measurement errors are made, and | f(x) — f(x)| < e. Then the error in the per- 
turbed interpolation polynomial is at most 


ln (2) — bax) < Yo Len le 
k=0 


n 
If the nodes are equidistant, x, = x9 + kh, with h = (xn — xo)/n, then the value of Yo |Lyy(x)| 
k=0 


increases slowly with n. A number of upper bounds is given in Table 2.3. 


Table 2.3: Upper bounds for iy ;emncale 
k=0 


a 
2.3 
1.6 


Assume that the interpolation points are equidistantly distributed on an interval [a,b]. One might 
wonder whether it is possible to approximate a provided (sufficiently smooth) function arbitrar- 
ily close using a Lagrange interpolation polynomial, by increasing the degree n of the interpo- 
lation polynomial. Surprisingly enough this is not the case. A well known example of this was 
suggested by Runge. 
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Example 2.3.1 (Runge’s phenomenon) * 


Consider the function 1/(1 + x?) on the interval [—5,5], with nodes x, = —5 + 10k/n, where 
k = 0,...,n. In Figure 2.3(a) the function is visualized together with the 6th and 14th degree 
interpolation polynomials. Note that the 14th degree polynomial approximation is better than 
that of degree 6 on the interval [-3, 3]. In the neighborhood of the end points, however, the 14th 
degree polynomial exhibits large oscillations. 


The oscillations of the interpolation polynomial near the boundary of the interval are known as 
Runge’s phenomenon. The above example leads to the question whether Runge’s phenomenon 
can be avoided. In order to answer this question, the expression for the interpolation error in 
Theorem 2.3.1 can be considered. This expression contains two factors that can become large: 
the unknown factor f‘"+!)(@) and the polynomial factor (x — xg) «+ « (x — Xn). Clearly, f("+)) (€) 
cannot be influenced, but the polynomial (x — xg) -- - (x — Xn) can be made small in some sense 
by choosing better interpolation points x;,i = 0,...,n. More specifically, the interpolation points 
should be chosen such that 

max |(x— x9) ++ (x= xn)| 

x€ [a,b] 
is minimized over all possible choices for the interpolation points x9,...,Xn. This problem was 
solved in the 19th century by the Russian mathematician Chebyshev, who found that the optimal 
interpolation points are given by 


a+b b—-a 2k+1 
— — —— k=0,... 
XK 5 + 5 cos (a ae 5") 0,...,N, 


which are called Chebyshev nodes. In Figure 2.3(b), it can be seen that the interpolation polynomi- 
als using Chebyshev nodes provide much better approximations compared to the equidistantly 
distributed ones in Figure 2.3(a). It can be shown that by using Chebyshev nodes, a continuously 
differentiable function can be approximated with arbitrarily small error by increasing the degree 
n of the interpolation polynomial. 


Example 2.3.2 (Extrapolation) 


Large errors may also occur in extrapolation. Consider the function f(x) = 1/x. The nth degree 
interpolation polynomial is determined with nodes x, = 0.5+k/n, k = 0,...,n. In Figure 2.4, 
f(x) and the interpolation polynomials of degree 6 and 10 are shown on the interval [0.5, 1.8]. 
The polynomials and the function itself are indistinguishable on the interval [0.5, 1.5]. With 
extrapolation (x > 1.5), however, large errors occur. Again the largest error occurs in the highest- 
degree polynomial. 


2.4 Interpolation with function values and derivatives * 


2.4.1 Taylor polynomial 


A well known method to approximate a function by a polynomial is Taylor’s method. As an 
approximation method it is not used very often in numerical mathematics, but instead it is used 
quite often for the analysis of numerical processes. 


In many cases the approximation of the Taylor polynomial becomes better with increasing poly- 
nomial degree. However, this is not always true. This is shown in the following example: 


Example 2.4.1 (Taylor polynomial) 


The function f(x) = 1/x is approximated in x = 3 by a Taylor polynomial of degree n about 
x = 1 (note that x = 3 is outside the region of convergence of the Taylor series). The derivatives 
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(b) Chebyshev nodes 


Figure 2.3: Interpolation of f(x) = 1/(1 + x*) with Lagrange polynomials of degree 6 and 14. 


are given by: f")(x) = (—1)*k!x~ +), such that the nth degree Taylor polynomial about x = 1 
equals 
n n 
pa(x) = Yo fl (1)(x—1)'/ = Yo (-1)K(x — 1) 
k=0 k=0 
The values of pn(3) that should approximate f(3) = 1/3 are tabulated in Table 2.4. The approxi- 
mation becomes less accurate with increasing polynomial degree. 


Table 2.4: Approximation of f(x) = 1/x in x = 3, using an nth degree Taylor polynomial about 
x=1. 


pon [Oli [2/3{4)5 [6] 7 | 


Pa) 
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Figure 2.4: Extrapolation of f(x) = 1/x with Lagrange polynomials of degree 6 and 10. 


2.4.2 Interpolation in general 


In general, given a function f, a polynomial p is constructed such that at a number of pairwise 
distinct nodes xo,..., not only the values of f and p coincide, but also the value of their deriva- 
tives up to and including order m; at node x;, k = 0,...,n. 


Suppose that f € C”[a,b] with m = ec Then p is the polynomial of lowest degree such 
SKN 
that 
dlp 
dxi 


di 
(xp) = FZ (x) foreach k=0,...,n and j =0,...,mg. 
% 
Remarks 
n 
e This polynomial is at most of degree M = )) my +n. 
k=0 


e Ifn =0, then p is the Taylor polynomial of degree mg about xo. 


e If my =... = myn = 0, then p is the nth degree Lagrange polynomial of f in the nodes 
Ay ees Mt 
As an example, Section 2.4.3 deals with the case mg = ... = my, = 1. The resulting polynomials 


are called Hermite interpolation polynomials. 


2.4.3 Hermite interpolation 


If both function values and its first derivatives are known at certain points, these data can be used 
to construct an approximation polynomial of a higher degree. 


For instance, assume that the function values and the value of the first derivative at two different 


points are known, that is, 
(xo, f(x0)), (x1, f(x1)), 


(xo, f'(xo)), (x1, f"(%1)) 
are provided. Using these four independent values, a third degree polynomial, H3, can be con- 
structed, which satisfies 

H3(xj) = f(xj),  j=0,1, 


H3(xj) = f'(xj), 7 = 0,1. 
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Writing H3(x) = cy + c1x + c2x* + €3x° leads to the following system of linear equations: 


1 x9 x9 x) co f (x0) 

ee 2 ae Ga C1 = f(x1) (2.9) 
Od 2x Bx4 C2 f'(xo) |° , 

0 1 2x, 3x7 3 f'(x1) 


In the same spirit as for Lagrange interpolation, this polynomial can be written as 


1 
H3(x) = > (f (xe) Ha (x) +f! (xe) Ha (x)), 
k=0 

with 

Aya (xj) = dj, = Ha (xj) = 0, 

ia (xj) = 0, Hig (Xj) = Oj 
This avoids having to solve system (2.9). Polynomials based both on given function values and 
given derivatives are called Hermite interpolation polynomials. 


The general expression for Hermite interpolation polynomials containing function values and 


first derivatives is 
n n 


Anyi (x) = 2 Fe) Fen) a Le fn Hin 2), 
in which 
Hy, (x) = (1 — 2(x — xg) Ln (Xx) Ei), 
and 


Hin (x) = (x — xy) L2,, (x). 
The functions L;,(x) are the Lagrange basis polynomials, as defined in equation (2.8). 


The correctness of the Hermite interpolation polynomial is shown as follows: Hy, and Hx, are 
polynomials of degree 2n + 1. Note that Hj,(x;) = 0 if k  j, because in that case, Ly,(x;) = 0. If 
k = j, then, because Ly, (x,) = 1, 


Fyn (xj) = Hen (xe) = ((1 — 2( xe — Xk) Lig (Xe) LE (XK) = 1. 
Because H,(x;) = 0 for all j = 0,...,n it follows that H2n41(x;) = f (xj). 
Next, we have to show that the derivatives coincide at the nodes. Therefore, Hy, is calculated: 
Hin (*) = —2L yen (Xe) Lin (*) + (1 — 2(x = Xe) Lin (%k)) 2Len (2) Len (2): 
For all j = 0,...,n, Hj,,(xj) = 0. The derivative of Hy, is given by 
Hig (x) = Lp, 2) Ee 
Hence H;,,(x;) = 6), and therefore H5,,, (xj) = f’(xj)- 


Indeed, the Hermite interpolation polynomial satisfies the conditions H2n+41(x;) = f(x;) and 
Ans (Xj) = f' (4), 7 =0,.--, 0. 


Theorem 2.4.1 Let xo,...,Xn be distinct nodes in [a,b], x9 < x1 < ++: < Xn. Let f € C?"**[a,b] 
and let Hy, be the Hermite interpolation polynomial that equals f and f' at these nodes. Then for each 
x € [a,b] there exists a & € (a,b) such that 


ame CN eee eee oO 
Flt) ~ Hays (x) = SAAR EAT an, 
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Proof: 
The proof of this theorem is analogous to that of Theorem 2.2.1. The auxiliary function is chosen 
as follows: 


(t —x9)*---(f-2xn)* 
= f(t)-—H t) -— ———_———_—_——— — H ; 
p(t) = f(t) — Honsi(t) Gan Goae 2n+1(x)) 
It can be proved that g’(t) has 2n + 2 different zeros in (a,b). xX 
The choice for Hermite interpolation instead of Lagrange interpolation is clarified in the follow- 
ing two examples. 


Example 2.4.2 (Seismic waves) 


Seismic waves, the same type of waves used to study earthquakes, are also used to explore deep 
underground for reservoirs of oil and natural gas. A simple model for wave propagation is 
described by the following set of ordinary differential equations: 


ig Pa. : 

tt = csiné@, 
dz _ _ 

7 = ccos 0, 
do _ _dcg; 

at = dz sin é. 


The position is denoted by (x,z) while @ is the angle between wave front and x-axis. Suppose 
that the propagation speed c depends on the vertical position only and that it is known at a 
finite number of positions. However, in order to solve this system an approximation of c(z) in 
all intermediate points is required. If piecewise linear interpolation is used, the derivative c’(z) 
does not exist in the nodes, which may lead to large errors in the solution. If both c and c’(z) are 
known in the nodes, then the third degree Hermite interpolation polynomial can be used in each 
interval and the first derivative exists also at the nodes. 


Example 2.4.3 (Visualization) 


Suppose a finite number of points of a figure is known and a smooth curve should be drawn 
through them. Because a piecewise linear interpolation polynomial is not smooth in the nodes, 
this often leads to an unrealistic representation. A better result is obtained using Hermite inter- 
polation. 


Suppose the graph of the function f(x) = 1/(1+x°), x € [0,4], is used to visualize half of a 
symmetric hill. In Figure 2.5 the graph is approximated with several interpolation polynomials. 
The resulting curve using piecewise linear polynomials does not resemble a hill even remotely. 
Hermite interpolation on the same nodes provides a much better result. However, this is not an 
honest comparison because third degree Hermite interpolation uses twice as many input values. 
Therefore, the third degree Lagrange interpolation on the intervals [0,2] and [2,4] is considered. 
Note that on the interval [2,4] the function and the polynomial almost coincide. However, due to 
the large jump in the derivative at x = 2, this result is useless. 


2.5 Interpolation with splines 


In the previous sections, it has been demonstrated that the approximation of a function using an 
interpolation polynomial can cause some problems. It is often better to divide the interpolation 
interval into subintervals and to construct an interpolation polynomial on each subinterval. One 
problem with this approach is the lack of smoothness of the interpolation polynomial at the inter- 
face of two subintervals. Hermite interpolation is a remedy for this problem, but for this to work 
one has to know the derivative at the interface. If the data are gathered from measurements, then 
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0.8; 


0.47 


0.2¢ 


Figure 2.5: Interpolation of f(x) = 1/(1 +x) using two intervals: linear, Hermite (degree 3) and 
Lagrange (degree 3) interpolation. 


the derivative at the nodes are often unknown. One way to ensure smoothness without knowing 
the derivatives is to use so-called splines. 


A spline is a piecewise polynomial that is connected smoothly in the nodes. We define the nodes 
a =Xq <X1 <+++ < xX, = b, which partition the interval [a,b]. A spline of degree p is defined as 
the piecewise function 

so(x), x € [xo, x1], 

(x), x & [x1 xa], 


s(x) = 
Sy_1(X), xX © [X_-1, Xn], 


where 5; (X)|{x,,x,,,] 18 a polynomial of degree p and a smooth connection of s; and its derivatives, 
ee sP-)) is required in the nodes. 
In this section, only linear (p = 1) and cubic (p = 3) interpolation splines are considered. 


Definition 2.5.1 (Spline of degree 1) Let xo,...,Xn be distinct nodes in [a,b], x9 < x1 <-+++ < Xn. 
For f € Cla, b], the interpolation spline s of degree 1 is a function s € Cla, b] such that 


a. On each subinterval |x, X44], § is linear, k = 0,...,n—1, 
b. The values at the nodes satisfy s(x~) = f (xx), fork =0,...,n. 


Note that an interpolation spline of degree 1 consists of piecewise linear interpolation polynomi- 
als. 

A spline of degree 3, also called a cubic spline, consists of polynomials of degree 3. In the nodes 
the value is equal to the function value, and further, the first and second derivatives are continu- 
ous. Splines were originally developed for shipbuilding in the days before computer modeling. 
Naval architects needed a way to draw a smooth curve through a set of points. The solution 
was to place metal weights (called knots or ducks) at the control points and bend a thin metal or 
wooden beam (called a spline) through the weights. The physics of the bending spline meant that 
the influence of each weight was largest at the point of contact and decreased smoothly further 
along the spline. To get more control over a certain region of the spline the draftsman simply 
added more weights. 
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Definition 2.5.2 (Natural spline of degree 3) Let the nodes xq,...,Xy be distinct in |a,b] and satisfy 
Xo <Xy < +++ < Xp. Fora function f € C{a,b], the interpolating spline s of degree 3 has the following 
properties: 


a. Oneach subinterval |x, X¢41], 8 is a third degree polynomial s, ,k =0,...,n—1, 
b. The values at the nodes satisfy s(x.) = f (xx), fork =0,...,n, 


C. Se(XK41) = Sey1 (X41), k =0,...,n — 2, 


Si (Xe41) = Sp44 (X41), k =0,...,n— 2, 
Se (Xn41) = 8¢44(%n41),k = 0,...,n-2. 


In order to uniquely determine the unknowns, two extra conditions are required. A viable choice is to 
prescribe 


d. sy (xo) = 8))_4(x%n) = 90, 
which implies that the curvature vanishes at the boundary. 


Note that s € C?[a,b]. The conditions under d could be replaced with other conditions, but these 
so-called natural boundary conditions belong to the shipbuilding application. 


In order to demonstrate how to determine such an interpolating spline, s; is expressed as 


s(x) =a (x — xy)? tho my Pam) ede k=O n= 1. (2.10) 


Define hy = xp41 — X",k = 0,...,n—1,and f, = f(x), k = 0,...,n. From property b it follows 
that 
(xz) = de = fr, k=0).22,0— 1; (2.11) 


Next, the various conditions from c are used. 
1. Second derivatives: si (x41) = si4(*k41) 


From (2.10) it follows that s/(x) = 6a,(x — x) + 2b,, such that 
6a;hy + 2b = Se (Xk41) = Sear (Xk41) = 2be44, k=0,...,n—2. 


By defining the value by as by = s\'_,(xn)/2, the relation 6a;h, + 2b, = 2b;41 holds for 
k =n—1as well. This means that a; can be expressed in by: 


1 
a = x —(de41 — Og), kK =0,...,n-1. (2.12) 
3h 


2. Function values: s; (x41) = S¢41(%K41), together with s(xn) = f (xn) 


By defining the value d, as dy, = fn, this reduces to ayh? + bh? + crhp + dy = dey, 
k = 0,...,n—1. By substituting equations (2.11) and (2.12) for d, and ax, respectively, 
the following expression for c;, is derived: 


Cc. = oS 


Set Sep, Pet Pet p= 0, nod. (2.13) 


3. First derivatives: s; (x41) = S;44 (%k41) 


This relation yields 3a,h?. + 2byhy + Ck = Cky1,k = 0,...,n — 2. Substituting a, c, (equation 
2.13) and d, leads to 


Ing Best — by) + 2hyby + Het Sie _ py, Zhe + baa _ fa — Feat _y,  2bett + bia 
I 3 Ak 3 
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Figure 2.6: Interpolation of f(x) = 1/(1 +x?) using a cubic spline on six subintervals. 


which can be simplified to 


ind + 2(l + nea Oa + heybey2 = 3 (eg list — fee Se) | go, jn 2 
k+1 k 


Property d implies that sj (x9) = 2bp = 0, such that by = 0. Using the definition by, = s/”_,(xn)/2 
(step 1), we also find b, = 0. Therefore, the resulting system consists of m — 1 equations for n — 1 
unknowns b1,...,0;,—1: 


3 (44 ff) 
2(ho + hy) hy by ti Ig 
hy 2(hy + hp) ho bo 5 
a) 2(hn-2 + hn-1) Dy-1 3 (425 fup-Su2) 
n—-1 n—2 


From this system the values b; can be calculated, from which a; and cz, can be determined using 
equations (2.12) and (2.13) and bp = by = 0. 


Example 2.5.1 (Visualization) 


In this example, a cubic spline on six subintervals is used to approximate a symmetric hill (ex- 
ample 2.4.3). The result in Figure 2.6 is of higher quality than the Hermite and Lagrange inter- 
polation approximation in Figure 2.5. Note that for Hermite interpolation, the derivatives at the 
nodes should be known, whereas this is not required in the cubic spline approach. 


Remark 
Splines are used in CAD and experience a revival in modern numerical methods for partial dif- 
ferential equations. 
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2.6 Summary 


In this chapter, the following subjects have been discussed: 


- Linear interpolation; 
- Higher-order Lagrange interpolation; 
- Interpolation with derivatives: 


- Taylor polynomial; 
= Interpo ation in general; 
- Hermite interpolation; 


- Spline interpolation. 


2.7. Exercises 


1. (a) Prove upper bound (2.4) for linear interpolation: 


(b) Show that the difference between the exact and the perturbed linear interpolation 
polynomial is bounded by 


ber = x1 +x = x0 


=f < 
[La(x) — fa(x)| < AE 


y 


and that, for x € [xo, x1], 
|Li(x) — Li(x)| <e, 


and if x > x1, 


|Ly(x) — £4 (x)| < (1 aaa ) €. 
1 0 
2. Determine the second degree Lagrange polynomial of f(x) = 1/x using nodes xp = 2, 
x1 = 2.5 and x2 = 4. Approximate f (3) with this polynomial. 


3. Determine s(1/2), in which s is the cubic spline with nodes 0, 1 and 2, both for the function 
f(x) = sy and tor f(x) =a. 


4. The distance that a car has driven is recorded every minute. The measured data are tabu- 
lated below: 


Time fmmin [0 [ T [2 [3 


[Distance Gin km): [0 [05 [1 [| 


(a) Compute the Lagrange interpolation polynomial of degree 3. Use this polynomial to 
estimate the maximum speed of the car. 


(b) Compute the cubic spline that interpolates the above data. Use this spline to estimate 
the maximum speed of the car. 
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Numerical differentiation 


3.1 Introduction 


Everyone who possesses a car and/or a driver’s licence is familiar with speeding tickets. In 
The Netherlands, speeding tickets are usually processed in a fully automated fashion, and the 
perpetrator will receive the tickets within a couple of weeks after the offence. The Dutch police 
optimized the procedures of speed control such that this effort has become very profitable to the 
Dutch government. Various strategies for speed control are carried out by police forces, which 
are all based on the position of the vehicle at consecutive times. The actual velocity follows from 
the first-order derivative of the position of the vehicle with respect to time. Since no explicit 
formula for this position is available, the velocity can only be estimated using an approximation 
of the velocity based on several discrete vehicle positions at discrete times. This motivates the use 
of approximate derivatives, also called numerical derivatives. If the police want to know whether 
the offender drove faster before speed detection (in other words, whether the perpetrator hit the 
brakes after having seen the police patrol), or whether the driver was already accelerating, then 
they are also interested in the acceleration of the ‘bad guy’. This acceleration can be estimated 
using numerical approximations of the second-order derivative of the car position with respect 
to time. 


Since the time-interval of recording is nonzero, the velocity is not determined exactly in general. 
In this chapter, the resulting error, referred to as the truncation error, is estimated using Taylor se- 
ries. In most cases, the truncation error increases with an increasing size of the recording interval 
(Sections 3.2 and 3.4). Next to the truncation error, the measurement of the position of the vehicle 
is also prone to measurement errors. Issues that influence the results are, for example, paral- 
lax, the measurement equipment, and in some cases even the performance of the police officer 
(in car-videoing and laser control). These measurement errors provide an additional deteriora- 
tion of the approximation of the speed and acceleration. The impact of measurement errors on 
approximations of derivatives is treated in Section 3.3. 


3.2 Simple difference formulae for the first derivative 
Suppose f is a continuously differentiable function. The forward difference is defined as 


y 


Qp(ny = Leth = fe) h>0, 


in which h is called the step size. By definition, 


him fet) — FX) 
h 


h-0 


= f'(x), 
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and hence, the forward difference converges to the derivative. The truncation error is defined as 


fla +h) — fix) 


Rp(ht) = f(x) - EO) G.1) 
Theorem 3.2.1 If f € C*[x,x +h], then there exists a € € (x,x +h) such that 
h 
Re(h) = —5f"(@). (3.2) 


Proof 
Taylor expansion of f(x +h) about x yields 


2 
f(xth) = f(x) +hf' (x) + "0, with € € (x,x +h). 
Substitution into (3.1) yields 


i "(x nh? cn apie 
Ry(h) = f(x) - fb aE | zee (¢) bal De: 5 F"O, 


which completes the proof. xX 
Example 3.2.1 (Forward difference) 


In this example, the derivative of the function f(x) = —x? +x? + 2x is approximated in x = 1 
using the step size h = 0.5. In Figure 3.1, it can be seen that the approximation of the derivative 
is not yet very accurate. 


3 
—f 
--- exact derivative 
asym forward difference of” : 1 
“'"'" central difference ae 
al J 
= 
15+ | 
al | 
085 0.5 1 1.5 1.8 
x—h x x+h 
Figure 3.1: Approximation of the derivative of f(x) = —3+4x*42xinx = 1 by a forward 


difference or central difference. 


Similar to the forward difference, a backward difference can be defined as 


g, = Aw fen 36. 


The truncation error for this formula is 


R,(h) = Z 


at Ds for some 7 € (x —h,x), 
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which means that the accuracy is comparable to that of the forward difference. 


Note that the errors have opposite sign. Moreover, ¢ and y will differ, but for small h this differ- 
ence is very small. It may therefore be interesting to average these two formulae such that these 
errors almost cancel. This average is known as the central difference: 


Q,(h) 2 eS 


For this particular example, this results in a better approximation (Figure 3.1). If the third deriva- 
tive of f exists and is continuous, then the truncation error in the central difference becomes 


x — f(x - g 
Re(h) = f(x) - EBSD) Ee, 


with ¢ between x — h and x + h. To prove this, f is expanded in a Taylor polynomial about x with 
remainder term, and the intermediate-value theorem is used. Note that the error in the central- 
difference approximation converges much faster towards zero as a function of h, than the error 
of the forward or backward formula. 


Remark 

In the limit h — 0, a central difference approximation approximates the derivative more accurate. 
However, for a given h the error can still be larger than when forward or backward differences 
are used. 


3.3 Rounding errors 


Apart from truncation errors, also rounding errors play an important role in numerical differen- 
tiation. In this section, the rounding-error behavior of the central-difference formula is investi- 
gated. 


Suppose that the function values contain a rounding error of at most ¢, such that 
If(x—h) — f(x—h)| <e, and [f(x +h) — fle +h)| <e 

In that case, the rounding error of the central-difference approximation of f’(x), S-(/), is bounded 
by 
flx+h)—fle—h) _ flx+n)—fle—) 

2h 2h 
- Ufa) - fle WF +H —fle+Ml 
= wh Ss 


We note that the upper bound of S,(/) increases if h decreases. 


Sc(h) = |Qc(h) re Q-(h)| = 


=~] 


Furthermore, the truncation error of the approximation is given by 


2 
Re(ht) = f'(x) — Qelh) = -E FO), CE (Why x +), 


which will reduce if h becomes smaller. 


The total error of the approximation is equal to 
|Ec(H)| = [f"(x) — Qe(h)| = | f"(x) — Qe) + Qe(h) = Qe(h)| 
7 4 h? 
S [f/(x) — Qe(M)| + |Qc(t) — Qe(h)] = [Re(h)| + Sc(h) < Em + ; = o(h), 


in which m is an upper bound for | f’” (x)| ina sufficiently large neighborhood of x. The minimum 
value of this upper bound is p(hopt) = V9em/8, for Nopt = V/3e/m. 
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Example 3.3.1 (Rounding errors) 


In this example, the derivative of f(x) = e* is approximated in x = 1, using the central-difference 
formula and six digits. In that case, the used function values, f(x), contain rounding errors (Ta- 
ble 3.1), whose absolute value, | f(x) — f(x)|, is at most 5- 10~°. The corresponding approxima- 
tions, Q.(h), are presented in Table 3.2 together with the errors f/(1) — Q-(i). 


Table 3.1: Value of e*, six digits. Table 3.2: Results of the central-difference formula. 


Ec(h) = f’(1) — Oe(h) 
2.73645 0.01817 
2.72285 -0.00457 
2.71850 -0.00022 
2.72000 -0.00172 


2.70000 0.01828 


In Table 3.2, it can be seen that for decreasing h, the approximation becomes better at first but 
it deteriorates rapidly for very small h, which is caused by the rounding errors in the function 
values. 


For h < 0.1, we know that m = max{ f’”"(€),€ € (1—h,1+h)} =e. The function g(h) and the 
upper bounds of |R-(/)| (equal to h*m/6) and S.(h) (equal to ¢/h) that belong to this example 
are drawn in Figure 3.2. Indeed, it can be seen that, for decreasing h, the upper bound of the 
truncation error decreases, whereas the rounding-error bound increases. It is easy to verify that 
p(h) is minimal for hopt = 0.0171, with p(hopt) = 0.00044. Note that g(/1) is almost constant in 
the neighborhood of Mopt (in particular for h > hopt), hence a step size close to Nop will lead to a 
comparable accuracy. 


Remarks 


e If the approximation of the derivative is insufficiently accurate, then there are two possible 
approaches to improve this: 


- Decrease the rounding and/or measurement error in the function values; 


- Use a higher-order difference formula. In that case the truncation error, R(/), is often 
smaller compared to a lower-order formula. Note that h should not be too small, in or- 
der to avoid that the loss in significant digits due to the rounding error will annihilate 
the improvement. 


e Difference formulae are often used to solve differential equations (Chapter 7). In that case 
the solution to the differential equation has to be determined accurately rather than the 
derivatives. The behavior with respect to rounding errors is often much better in that case. 


e The effect of rounding errors is large when the error in the function values is large and if 
h is small. This often happens with inherently imprecise measurements. In a computer 
generated table, the rounding error is often much smaller than in a table generated by 
measurements, but still noticeable for a small h. 


e For h < hopt, the measurement or rounding error may even explode when h — 0. 
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error 


Figure 3.2: Upper bounds of absolute truncation error, rounding error, and total error, using 
central differences in approximating the derivative of f(x) = e* inx = 1. 


3.4 General difference formulae for the first derivative 


In the previous sections, some simple difference formulae to approximate the first derivative 
have been introduced. In many practical applications, it is necessary to use difference formulae 
of higher accuracy (for example to counteract rounding errors) or formulae with non-equidistant 
nodes. In this section, a general method to derive difference formulae will be presented. This is 
done in a way, such that the order of the truncation error, given by 


[f'(x) — Q(h)| = O(h?), 
is as high as possible. 


Assume that f’(x) is approximated by a difference formula based on nodes x; = x + ih, h > 0. 
The general form of Q(h) is given by 


Q(h) = : Yaifi, in which fi = f(x). (3.3) 


Since the derivative of f(x) is defined as 


F') = fim LEM £09), 


it is natural to include the division by h in equation (3.3). The second-order derivative f’’(x) 
equals 
vf Gee Ih) =f (a) 
dT = 1 f (x , 
a eee h 
and hence one more division by h follows. Repeating this argument for higher-order derivatives 
motivates the following rule of thumb: 


Rule of thumb 1 
In a difference formula for the kth derivative, the coefficients have to be divided by hk. 


In order to determine coefficients «;, a Taylor expansion of f; is computed about x in the variable 
h. The coefficients are determined such that the order of the error is maximized. Two examples 
of this idea will be provided using central or one-sided difference formulae. 
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Example 3.4.1 (Central difference) 


In this example, x_1 = x —h, x9 = x and x; = x +h are used, such that 


a_if(x—h) +aof(x) +arf(x +h) 


Q(h) = / : (3.4) 
Taylor expansion yields 
3 
fe—h) = fle) — hfe) +S 9"e) — FM) + OW), 
f(x) = 0), 


2 3 
f(x +h) = f(x) thf) + Spey +E m1 


Substituting these expansions in equation (3.4) yields 


62 (f(x) — Af! (x) +h? /2f" (x) — 8/316" (x) + O(h*)) 
h 
4 Mof(x) tar (F(x) +f (x) HE /2f" (x) +P /31f" (x) + OC) 
h 
= f(x)(a_y/h+ag/h+a,/h) + f'(x)(—a_1+ 41) 
+ f(x) (ha_1/2 + hoy /2) + fl" (x) (—h?a_1/3! + h? 01/3!) + O(n). 


(x) +O(h*), 


Since Q(h) should approximate the first derivative of f, coefficients a_1,a9, and a, are deter- 
mined by the conditions 


FES ato} + 4 = @ 
f'(x): —H1 + a = 1, 
Fea sha_y + she, = 0. 
Solving this system leads to a_; = —1/2,a) = 0 and a; = 1/2, which yields the central- 
difference formula 
x+h)—f(x-h 
Q(h) = pA a A) t ( ) (3.5) 


with truncation error O(h?). 


Example 3.4.2 (One-sided difference formula of order 0(h7)) 


Suppose a derivative has to be determined at the left boundary of an interval, with a truncation 
error of order O(h?). Choosing xo = x,x1 =x +h, x2 =x+2hand 


_ Hof (x) + aa f(x +h) + oof (x + 2h) 
0) SS 


the Taylor expansions are given by 


2 
fet h) = fx) +hf(e) + SP") +00), 
flor +2h) = fx) +2 f(x) + 2 F"(x) +O), 


The following conditions are obtained: 


f(x): o + 424 2 2=4 
eae a + 2w = 1, 
Na pas hy, + 2hag = 0. 
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Solving this system yields aj = —3/2,a, = 2 and a) = —1/2, such that the one-sided difference 
formula equals 


—3f (x) + 4f (x +h) ~ f(x + 2h) (3.6) 


Q(ht) = = 


with truncation error O(h?). 


3.5 Relation between difference formulae and interpolation * 


In Section 3.4, a general way to derive difference formulae has been discussed. A clear disad- 
vantage, however, is that a set of equations has to be solved, where the number of unknowns 
increases with the order of accuracy. 


In this section, a different approach is considered using Lagrange interpolation polynomials to 
approximate functions (Section 2.3). It seems rather obvious that the derivative of the interpola- 
tion polynomial may be taken as an approximation of the derivative of the function. This idea is 
pursued to derive difference formulae in a different way. 


If x0, X1,---,Xn do not coincide and f € C"*1[a,b], then the Lagrange interpolation formulation 
is given by 


_ f OG) 
FOS YX F(R) Een) + (%— x9) ++ ie ae 


in which ¢(x) € (a,b). Differentiation of this expression yields 


Nx) = ¥ / d OLN) 
Fila) = YE Fae) Len 2) + Fy (= 20) 0) 
(n+1) x 
een ee ee cee CD (3.7) 


dx (n+1)! 


The last term generally cannot be evaluated. However, if x = Xj, then equation (3.7) becomes 


bay = floc tagy 4 Pag ang Oe 
Fa) = Ye fbi) + T1G5 — 0 Ga 
kj 


Example 3.5.1 (Forward difference) 


For n = 1,x = xo and x1 = xo +h, the Lagrange basis polynomials satisfy 


- 1 
In(x) = gob, Lyx) = --, 
Ly (x) => art 4 Li (x) => i: 


This means that the derivative is given by 


(x0) =~ FF 20) + flo +H) + SF" @), 


which is exactly the forward-difference formula. 
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3.6 Difference formulae of higher-order derivatives 


In the introduction it has been explained that second-order derivatives are required in order to 
compute the acceleration from measured position values. 


Higher-order derivatives may be approximated using a similar approach as for the first deriva- 
tive. Suppose that f”’ (x) should be approximated with a central-difference formula Q(h) . Choos- 
ing x_y =x —h,xo = x and x; = x +h, the approximation is computed as 


a_if(x—h) +aof(x) +arf(x +h) 


Q(h) = 5 ; 


where the division by h? has been motivated in Rule of thumb 1. Taylor expansion about x yields 
'y y Pp y 
! h 1 he aw 4 
f(x h) = fx) thf) + sf) + af) + Oh"), 
f(x) = f(x), 
/ h? 1 he wy 4 
f(x —h) = f(x) hf (x) + Sf (x) — aff (x) + OF"). 


The conditions for a_1, 9 and «1 are 


f(x) oS +8+ ¢# = 4G 
fi) + = 0, 
limes e 5u_4 + ral = iy 
Solving this system yields w_; = 1,4) = —2 and a; = 1, such that 
Rhy 2p) Flea 
(iy = Hern = fla) + eB) ex 


Note that the error in this formula is O(h?). 


Rule of thumb 2 
To derive a numerical method for the kth derivative with truncation error O(h?), the Taylor 
expansions should have a remainder term of order O(h? rh), 


Another way to determine the second derivative is the repeated application of the formula of 
the first derivative. To approximate the second derivative numerically, at least three points are 
required. Suppose that the second derivative is approximated using three equidistant function 
values, f(x —h), f(x),and f(x +h). 


The obvious way to approximate the second derivative is to take the difference of the first deriva- 
tive at the points x + 1 and x — h. Using central differences, this means that the function values 
in x + 2h and x — 2h are required. This can be avoided by introducing auxiliary points x + h/2 
(Figure 3.3) and taking a central difference of the derivatives at these points: 


wy fet gh) = f(x gh) 
; | 


f(x) 
Applying a central-difference scheme again to the first-order derivatives yields 


Heya 1 (LEtO-f@) _ f@)-fe-» 
rea, (Ae ae), 


which can be simplified to 
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~ 
N= 


x | x x 
x—h x x+h 
Figure 3.3: Nodes to approximate f”’(x) using a repeated central-difference formula. 


The truncation error is obtained using Taylor polynomials, and equals 


f"(x) — Qh) = - pf). (3.9) 


Note that the effect of rounding errors is even more serious here. An upper bound for the error 
caused by rounding errors is: 


de 
S(h) < re 


To solve a certain type of differential equation it may be necessary to approximate (vf’)’, in 
which 7 is a (given) function. Analogous to the above, the following approximation of (vf’)/ can 
be used: 
(of!)(x+ 5h) — (of')(x — 5h) 
; ; 


Applying central differences once more yields 


o(x + gh) (f(x +h) — f(x) - (a = ah) F(@) — fe) 


3.7 Richardson’s extrapolation 


3.7.1 Introduction 


In the previous sections, some truncation error estimates for various difference formulae were 
presented. These expressions often contain a higher-order derivative of the function. However, 
if the first derivative is unknown, it is impossible to determine a higher-order derivative. In 
short, these estimates are often of theoretical importance, but useless in practice. In Section 3.7.2, 
it will be explained how Richardson’s extrapolation can be used as a practical method to make 
a numerical estimate for the error. Furthermore, Section 3.7.3 explains the use of Richardson’s 
extrapolation to obtain a higher accuracy with a low-order method in case the order of the error, 
p, is known. 


In Richardson’s extrapolation, a numerical method Q(h) is used to approximate an unknown 
value M, where we assume that the error in this approximation has the form 


M — Q(h) = cph? + O(hP*1), (3.10) 


in which cy # 0, p € IN. Note that the error has this form if it can be expressed as a Taylor series 
and that p is the order of the error. 


In this section, Richardson’s extrapolation will be applied to difference formulae, but the method 
can also be used for other types of approximations (interpolation, numerical integration etc.), as 
long as the error has the form (3.10). 
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Table 3.3: Approximation of f’(1) using f(x) = e*, and forward difference. 


Approximation 


4h = 0.1 Q(4h) = 2.8588... 


2h = 0.05 | Q(2h) = 2.7873... 
h = 0.025 | Q(h) = 2.7525... 


3.7.2 Practical error estimate 


In this section, Richardson’s extrapolation will be used to make a numerical estimate of the error 
M— Q(h). 
For h sufficiently small, equation (3.10) can be approximated by 


M — Q(h) = eph?. (3.11) 


For a given h, Q(h) can be computed, but M, Cp and p are still unknown. However, these val- 
ues can be approximated by computing for example Q(h), Q(2h) and Q(4h). In that way, the 
following set of equations is obtained: 


M — Q(4h) = cp(4h)?, (3.12a) 
M — Q(2h) = cp(2h)?, (3.12b) 
M — Q(h) = cp(h)P (3.12c) 


By subtracting equation (3.12b) from (3.12a) and equation (3.12c) from (3.12b), M can be elimi- 
nated and only two unknowns are left: 


Q(2h) — Q(4h) = cp(2h)P(2? — 1), (3.13a) 
Q(h) — Q(2h) = cp(h)P (2? — 1). (3.13b) 
Next, cp and h can be eliminated by dividing these two expressions, which yields 
Q(2h) — Q(4h) 
Q(h) — Q(2h) 


from which p may be determined. Substitution of p into (3.13b) provides an approximation for 
Cyh?, which is used in equation (3.12c) to find the error estimate M — Q(h). 


= 2P, (3.14) 


Example 3.7.1 (Practical error estimate) 


Once again, the numerical approximation of the derivative of f(x) = e* is approximated in 
x = 1, here using a forward-difference formula. The exact value of the derivative is known: 
M = f'(1) =e = 2.71828.... Using h = 0.025, the values Q(h), Q(2h), and Q(4h) are tabulated 
in Table 3.3. 


Substituting these results into (3.14) yields 


Q(2h) — Q(4h) _ —0.0714... _ 7 
O=00n, Sno, eee 


As expected the value of p is almost equal to 1. Because p (the order of the error) should be an 
integer, we take it equal to 1. From equation (3.13b), the following error estimate is obtained: 


Q(h) — Q(2h) 


= = P — 
M — Q(h) = cph ro 


= —0.0348.... 
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Note that the exact error equals 
ME OG) = 0227525 002 ie 


In this example the error estimate is very reliable. 


To receive a better approximation the error estimate can be added to the approximation: 


Ol) eph = 2.7525. 2 00948 oa = 27177 on 


In the above example, the value of p was computed using Richardson’s extrapolation. However, 
using Theorem 3.2.1, it is clear that p = 1, and this value could have been used immediately in 
equation (3.13b) in order to determine cyh?. In practice, more complex situations are found, and 
the following complications may occur: 


- It is not known whether higher-order derivatives exist and/or are bounded. 


- The final result is a combination of various approximation methods. The influence of these 
approximations on p is not always clear. 


- During implementation of the algorithm in a computer program, errors may be made. 


To reveal any of these complications it is good practice to verify whether the calculated p is close 
to the p that follows from theory. 


3.7.3 Formulae of higher accuracy from Richardson’s extrapolation * 


In several applications the value of p in (3.10) is known. In that case Richardson’s extrapolation 
can be used to determine formulae of higher accuracy. 


This is done by making use of the fact that the error estimates for Q(h) and Q(2h) equal 


M—Q(h) =cph? + O(h?P*1), (3.15a) 
M — Q(2h) = cp(2h)P + O(h?*1) . (3.15b) 
Multiplying equation (3.15a) by 2? and subtracting equation (3.15b) from this yields 
2P(M — Q(h)) — (M — Q(2h)) = 2? (cph?) — ep (2h)? + O(KP*), 
such that 
(2? — 1)M — 2PQ(h) + Q(2h) = O(hP*). 


This means that 

2PQ(h) — Q(2h) 
2P—1 

The value (2?Q(h) — Q(2h))/(2? — 1) is a new approximation formula for M with an accuracy 

that is one order higher than the order of Q(h). 


M= + O(nPt), (3.16) 


Example 3.7.2 (Forward difference of higher accuracy) 


As an example, the forward-difference method is considered. The error in the forward-difference 
formula may be written as 


f'(x) — Qf(h) = yh + O(N’), (3.17) 
and the difference for 2h equals 


f(x) — Q¢(2h) = 2h + O(N’). (3.18) 
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Multiplying equation (3.17) by 2 and subtracting (3.18) from this result yields 


f' (x) — (2Q¢(h) — Qf(2h)) = O(h”). 


Hence the difference formula 


2f(x+h)—2f(x) f(x+2h)— f(x) 3f (x) +4f(x +h) — f(x + 2h) 


20) Oh) = 7 Th 


has a truncation error of order O(h*). This formula is equal to (3.6). 


3.8 


Summary 


In this chapter, the following subjects have been discussed: 


3.9 


Difference methods and truncation errors for the first derivative; 

The effect of measurement errors; 

Difference methods and truncation errors for higher-order derivatives; 
Richardson’s extrapolation: 


- Practical error estimate; 
- Increasing accuracy. 


Exercises 


. Prove for f € C3[x —h,x +h], that the truncation error in the approximation of f’(x) with 


central differences is of order O(h?). 


. Assume that the position of a ship can be determined with a measurement error of at most 


10 meters, and that the real position of the ship during start up is given by the function 
S(t) = 0.5at*, in which S is expressed in meters and ft in seconds. The velocity is ap- 
proximated by a backward difference with step size h. Give the truncation error and the 
measurement error in this formula. Take a = 0.004. Determine h such that the error in the 
calculated velocity is minimal. How large is the error? 


. In this exercise, the central-difference formula 


x x) +f(x-4 7 
to) = Fenn Af) Her) _F Hae, @ € (x—h,x +h) 


is used to approximate the second derivative of f(x) = sinx in x = 1. The computer, and 


hence Matlab, uses finite precision arithmetic, and for this reason it computes perturbed 
function values f(x). The rounding error in the function value is bounded by 


na 


Lf(x) — F(x)| 
If(x)| 


in which eps is the relative machine precision (assume that eps = 10~!®). 


< eps, 


(a) Give an upper bound for the total error (truncation error + rounding error) and mini- 
mize this upper bound to find an optimal value for h. 


(b) Take h = 1. Compute the true error in the approximation for f”’(1). Repeat this eight 
times and divide h by 10 after every computation. Tabulate the results. Does the h for 
which the error is minimal agree with the value found in question (a)? Remark: the 
computations for this question have to be carried out using Matlab. 


Chapter 3. Numerical differentiation 37 


4. Let the function f(x) = sinx,x € [0,71] be given. Compute f’(1) by a central difference 
with h = 0.1. Estimate the error by Richardson’s extrapolation. 


5. Given are f(x), f(x+h) and f(x + 2h). Derive a formula of maximal order to approximate 
f(x). 
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Chapter 4 


Nonlinear equations 


4.1 Introduction 


The pressure drop in a fluid in motion is examined. For a flow in a pipe with a circular cross 
section of diameter D (meter), the Reynolds number, Re, is given by 

D 

Re= =, 
v 

in which 9 (11/s) is the average flow velocity and v (m?/s) is the viscosity of the fluid. The flow is 
called laminar if Re < 2100 (low flow velocity) and turbulent if Re > 3000. For 2100 < Re < 3000, 
the flow is neither laminar nor turbulent. 


For turbulent flows, the pressure drop between inflow and outflow is given by 


wLv 
Pout — Pin = aa 
in which w isa friction coefficient, 0 (kg/m?) is the fluid density, L (m) is the length and ¢ (m/s?) 
is the acceleration of gravity. If the fluid contains particles (sand, paper fibers), then the friction 
coefficient w satisfies the equation 


5.6 
1 _ In(ReVw) + 14-28 


Jo kk 
in which k is a parameter known from experiments. 


In this chapter, numerical methods will be discussed that can be used to determine w if the values 
of Re and k are known. 


4.2 Definitions 


In this chapter, various iterative methods will be considered to solve nonlinear equations of the 
form f(p) = 0. The point p is called a zero of the function f, or a root of the equation f(x) = 0. 
First, some useful definitions and concepts are introduced. 


Convergence 

Each numerical method generates a sequence {pn} = po, P1, P2,--. Which should converge to p: 
limn—oo Pn = p. Assume that the sequence indeed converges, with pn ¥ p for all n. If there exist 
positive constants A and « satisfying 


im [P= Patil _ x (4.1) 


ne |B Pall 
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then {p,} converges to p with order « and asymptotic constant A. 


In general a higher-order method converges faster than a lower-order method. The value of the 
asymptotic constant, A, is less important. There are two important cases: 


- « = 1: the process is linearly convergent. In this case A is called the asymptotic convergence 
factor. 


- « = 2: the process is quadratically convergent. 


Theorem 4.2.1 Suppose that a sequence {pn} = po, P1,P2,--. satisfies |p — pn| < k|p — pn—il, for 
n=1,2,...,whereO <k <1. Then limyn_soo pn = p: the sequence is convergent. 


Proof: 
By induction, 
Tim |p — pn| < lim k"|p — pol = 0, 


n—-0co 
because k < 1. Hence py converges to p. X 
Stopping criteria 
Because an iterative method will be used to approximate p, it is necessary to specify when the 
method should stop. Some examples of stopping criteria are listed below. 


1. If the solution, p, is known, then |p — pn| < € is the most useful stopping criterion. Since p 
is not known in general, this criterion cannot be used in practice. 


2. Two successive approximations can be used in the stopping criterion | pn — py_i| < €. How- 
ever, for some methods, it is possible that |py — py—1| < €, whereas |p — pn| > e. In that 
case the method stops at a wrong approximation of the solution. 


3. Since p may be very large or very small, it is often more meaningful to consider the relative 

error 
Pines Prt) ig Be ae 
[pn 

4. Finally, the accuracy of py can be determined by the criterion |f(pn)| < e. If this cri- 
terion is satisfied, and the first derivative of f exists and is continuous in the neighbor- 
hood of p, we can bound the error of the approximation. Using the intermediate-value 
theorem, we know that |f(p) — f(pn)| = [f’()|- |p — pul for some ¢ between p and 
Pn. Since f(p) = 0 and |f(pn)| < ¢€, we find that |f’(¢)|-|p— pal < ¢. By defining 
m = min{|f’(¢)|,¢ between p and p,} it follows that 


Ip— pal <=, if m#0. (4.2) 


Uncertainty interval 

In numerical simulations, the exact function values, f(x), are often unknown. Instead, an ap- 
proximation of the function values is used, which is denoted by f(x), in which | f(x) — f(x)| < @ 
Using the perturbed function, f , the zero of f cannot be determined exactly, as each point of the 
set I = {x € |a,b] | |f(x)| < €} can bea solution. 


The width of this uncertainty interval can be computed using that f(p) = 0 (p is the zero of f), 
and assuming that x* is the zero of the maximally perturbed function f(x) = f(x) +, such that 


fixt) =0. 


Linearization of f(x*) about p yields 


O= f(xt) = f(xt) tex f(p) + (x7 — p)f'(p) +E = (x7 — p)f'(p) +& 
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This means that f(x+) = 0 for : 
xt wy p- a 


fp): 


Similarly, the zero of the function f(x) = f(x) — Z is given by 
xR p+ae 


The values x~ and x* form the bounds of interval I. Because f’(p) can be either positive or 
negative, the general notation for the uncertainty interval is 
é é 
PTF PT TF@N 
if f’(p) # 0. If | f’(p)| is close to 0, then finding the root of the equation f(p) = 0 is an ill-posed 
problem. 


If e < é, then the stopping criterion | f(pn)| < € is useless, because it is possible that the algorithm 
continues, although pn € I. 


4.3 A-simple root finder: the Bisection method 


The first method, the Bisection method, is based on the intermediate-value theorem. Suppose 
f is a continuous function defined on an interval [a,b], where f(a) and f(b) have opposite sign 
(which means that f(a) - f(b) < 0). Then, according to the intermediate-value theorem, there is 
a number p in (a,b) with f(p) = 0. The Bisection method divides [a,b] into two equal parts and 
continues with the interval that contains p. 


To begin with, let aj = a, bo = b. In general, a new approximation for the zero is computed by 


Ayn + by 
Pn = ete 
If the chosen stopping criterion is satisfied, then the zero of f is found. If not, then a new interval 
[4n+1, by +1] is constructed using 


An41 = 4n and by41 = pnif f(an)- f(pn) <0, 


and 
Ayn41 = pnand by41 = by otherwise. 


Convergence 

Note that this method always converges to a solution: limy— oo pn = p. In general, it converges 
slowly, and it may happen that |p — py—1| < |p — pn|. The unconditional convergence makes 
the Bisection method a useful method for computing a starting estimate for the more efficient 
methods that will be discussed later in this chapter. 


Theorem 4.3.1 Assume f € C[a,b| and f(a) - f(b) < 0. Then the Bisection method generates a sequence 
{pn} converging to a zero p of f in which 


a 
IP — Pal S spre n= 0. 


Proof: 
For each n > 0, by — an = (b— a)/2" and p € (an, bn). Since pn = (an + bn) /2 it follows that 


bn — an b-—a 
PPA Se = a 
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4.4 Fixed-point iteration (Picard iteration) 


A different way to compute the zero of a function, is the use of a fixed-point method. A fixed point 
of a function g is anumber p such that g(p) = p. In this section, a numerical method is presented 
to approximate p. 


The relation between zeros and fixed points can be explained using a function f that has a zero 
p. It is always possible to define a function g, for example, g(x) = x — f(x), such that the fixed 
point of g is the zero of f. Note that g is not unique, for example, the fixed point of the function 
g(x) =x—cf(x),c € Vis also a zero of f. Conversely, if g(p) = p, then f(x) = x — g(x) vanishes 
in p. 

In the next theorem, sufficient conditions are provided for the existence and uniqueness of a fixed 
point. 


Theorem 4.4.1 Let g € C{a,b] and g(x) € [a,b] for all x € [a,b]. 
1. Then, g has a fixed point in |a, b] (Brouwer’s fixed-point theorem). 
2. If, in addition, g’ (x) exists for x € [a,b] and if there exists a positive constant k < 1 such that 
|e’(x)| <k forall x € [a,b], 
then the fixed point in [a,b] is unique (Banach’s fixed-point theorem). 
Proof: 


1. If g(a) = a or g(b) = b, then g has a fixed point at one of the end points of the interval. If 
this is not the case, then g(a) > a and g(b) < b since g(x) € [a,b]. Let the function h be 
defined as h(x) = g(x) — x. This function is continuous on [a,b], and furthermore, h(a) > 0 
and h(b) < 0. From the intermediate-value theorem it follows that there is a p such that 
h(p) = 0, hence p is a fixed point of g. 


2. Suppose |g’ (x)| < k < 1 and assume that there are two fixed points p and q with p < q. By 
the mean-value theorem there exists a € € [p,q] such that 


— 8) 
8(P) = 8) _ ge), 
| 
From this it follows that 
Ip — 4] = |s(e) — s(@)| = Is") Il — 41 < lp — al, 

which is a contradiction. Hence, p = q and the fixed point is unique. xX 
To approximate a fixed point of a function g, a starting value po has to be chosen. For n > 1, pn 
is determined by pn = g(pn—1). If the sequence converges to p and g is continuous, then 

p= lim pn = lim g(pn—1) = g( lim pn—1) = 8(p), 
hence p is a fixed point of g. This method is called fixed-point (or Picard) iteration. 
Remarks 
e It is easily seen that each converging fixed-point method is at least linearly convergent. 


e Close to the solution p, the error decreases as |p — py41| © |¢'(p)||p — pul If |9’(p)| is small, 
convergence will be fast, but if |¢’(p)| is only marginally smaller than 1, convergence will 
be slow. If |¢’(p)| > 1 there is no convergence. 


Note: if g’/(p) 4 0, then the process is linearly convergent, with asymptotic convergence 
factor |9'(p)|. If. g’(p) = 0, then the process is higher-order convergent. 


Chapter 4. Nonlinear equations 43 


¢ Note that k can be chosen as k = maxye{a,p] |’ (*)|- 


e If the fixed-point iteration pn = g(Py—1) satisfies the conditions of Banach’s fixed-point 
theorem (Theorem 4.4.1) and po and p are located in [a,b], then another stopping criterion 
can be formulated. Note that for m > n > 0, 


|Pm — Pn| =|Pm Pm—-1 + Pm-1 — Pm+2 +--+. + Putt Pn 
< |pm — Pm—1| + |Pm—-1 — Pm—2| +--- + |Pn+1 — Pal- 


Using the mean-value theorem and the fact that |¢’(x)| < k for all x € [a,b], we know that 


[Pnti — Pal = |g(Pn) — 8(Pn—1)| < klpn — Pn-al- 
Applying this recursively, we find that 


[Pm — Pnl Sk" |pn — Pail +--+ kl pn — Pu-il 
= (k+... +k") |pa— pail, 


and, since lim pm = p it follows that 
m— oo 
|p — pn| = lim |p — pal Sk Kp Spee pee ca (4.3) 
n snes m n| > ra n n—1 1_k n n—1]|- ¢ 
This means that stopping criterion 


[Pn — Pn-1l < € 


implies that |p — pn| < e. 
Example 4.4.1 (Picard iteration) 


To determine the real zero of the function f(x) = x° + 3x — 4 (which is p = 1), the function 


4 
8) = 243 


can be used in the iterative process. This function is determined by the following steps: 


=06 p?+3p—4=06 p+3p=46 pip? +3) =46 p= 
p p+ 3p p 


4 
pg — iP): 
Using pn = 8(Pn—1) and starting with po = 0, we find the values as in the second column of 
Table 4.1 (rounded to four decimals). Note that ¢/(x) = —8x/(x? +3)*, and hence it can be 
verified that |¢’(x)| < 1/2 = k. Using the third and fourth column of Table 4.4.1, the validity 
of equation (4.3) is shown: |p — pu| < |pn — Pn—1|. Finally, the fifth column shows that indeed, 
lp — Pnsil/|p — pnl| © |e/(p)| = 1/2 (remark after Theorem 4.4.1). 

Figure 4.1 illustrates the iterative process. The line y = x is used for the transformation of 
Y = 8(Pn-1) tox = pn. 


Theorem 4.4.2 Suppose g € C[a,b], g(x) € [a,b], for x € [a,b], and |g"(x)| <k < 1forx € [a,b]. 
Then the fixed-point iteration converges to p for each value po € [a,b]. 


Proof: 
Under the provided conditions, g has a unique fixed point p. From the mean-value theorem it 
follows that 


lp — pl = |g(p) — 8(Pn-1)| = |8/ (OP — Pn—-il < klp — pn—1|, where ¢ is between p and p,-1. 


From Theorem 4.2.1 it follows that pn converges to p. Xl 
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Table 4.1: Picard iteration pn = g(pPn—1) corresponding to example 4.4.1. 


lp — Pal/|P — Pn-il 


0.3333 
0.4884 
0.4964 


Po p2 P3 Pi 


Figure 4.1: Fixed-point method, used to solve p = 4/(p* +3) = g(p). The iteration process is 
marked by a bold line. 


4.5 The Newton-Raphson method 


The Newton-Raphson method is one of the most powerful and best-known numerical methods 
to solve a nonlinear equation f(x) = 0. The method will be explained using a Taylor polynomial. 


Suppose f € C?[a,b]. Let  € [a,b] be an approximation of the root p such that f’(%) 4 0, and 
suppose that |p — x| is small. Consider the first-degree Taylor polynomial about *: 


x — x)? 
fle) = fa) + (x9 @)+ SZ prey, (4.4) 


Because | p — £| is small, (p — ¥)? can be neglected, such that 


Ow f(x) + (p—¥)f'(%). 


Note that the right-hand side is the formula for the tangent in (*, f (*)). Solving for p yields 
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This motivates the Newton-Raphson method, that starts with an approximation po and generates 
a sequence {pn} by 
hi ( Pn-1 ) 


= Py_-1 - =, for n>. 
ROSEN Gre) 


The graphical illustration is that pn is the zero of the tangent in (py—1, f (Pn—1)), which is depicted 
in Figure 4.2. 
Example 4.5.1 (Newton-Raphson method) 


In this example, the positive zero of the function f(x) = x* — 2 will be determined using the 
Newton-Raphson method. To six decimal places the exact result is p = 2 = 1.41421. Note that 
f'(x) = 2x, such that the method generates approximations using 


(eaie ie 
2Pn—-1 
Starting with po = 1.00000..., the values p; = 1.50000..., py = 1.41666... and p3 = 1.41421... 


are calculated. This means that after only three steps, the solution is found to six decimal places. 
The procedure for the first iteration is explained graphically in Figure 4.2. 


Pn = Pn-1 — 


2 1 1 
— f(x) =x? 2 
1.5), - - = tangent in (1, —1) 1 


1 


0.5 


Figure 4.2: The Newton-Raphson method used to solve the positive zero of f(x) = x* — 2. The 
zero of the tangent in (po, f(po)) is pr. 


In the next theorem the Newton-Raphson iteration is considered as a fixed-point method. 


Theorem 4.5.1 Let f € C?{a,b]. If p € [a,b] such that f(p) = O and f'(p) # 0, then there exists 
ad > 0, such that the Newton-Raphson method generates a sequence {py} converging to p for each 


po € [p—6, p+]. 


Proof: 
In the proof, the Newton-Raphson method is considered as a fixed-point method pn = g(Pn—1) 


with f(x) 
g(x) =a f(x)’ 


In order to use Theorem 4.4.2, it is necessary to show that there exists a 6 > 0, such that g satisfies 
the conditions of the theorem for x € [p—6,p +]. 


46 Numerical Methods for Ordinary Differential Equations 


e Continuity of ¢ 


Since f’(p) 4 0 and f’ is continuous, there exists a 6; > 0 such that f’(x) 4 0 for all 
x € [p— 061, p+0,]. Therefore, g is well defined and continuous for all x in [p — 61, p +4]. 


e Derivative of g 
The derivative of g is given by 


(x) <1 LEP = Fdf"(w) _ FF"(w), 


(f!(x))? (f"(x))? 


Since f € C?[a,b] it follows that g(x) is continuous: ¢(x) € C![p— 61, p+6,]. Note that 
g'(p) = 0 because f(p) = 0. Since g’ is continuous, there exists a 6 < 5; such that 


|e'(x)| <k<1 forall x €[p—6, p+]. 


e Domain and range of ¢ 
Finally, we show that g(x) € [p—06,p +0] if x € [p—6,p+0]. Using the mean-value 


theorem, 


Is(p) — g(x)| =l3’(2) lp —x| for some ¢ between x and p. 
Since x € [p — 6, p +] it follows that |p — x| < 6 and |9’(é)| < 1, and hence, 


Ig(p) — a(x)| < |p—2] <6. 


Using Theorem 4.4.2, the sequence {p,} converges to p for each po € [p — 6, p + 6]. This proves 
the theorem. & 


Quadratic convergence 
The convergence behavior of the Newton-Raphson method can be computed by making use of 
the following observation: 


= ee 
0 = f(p) = (pn) + (p—Pndf'(pn) + LP p"(g) for Gn between pn and p. (45a) 
The Newton-Raphson method is defined such that 


0 = f(pn) + (Pn4i — Pn) f (pn). (4.5b) 


Subtracting expression (4.5b) from (4.5a) yields 


= 2 
(P— Prss)f'(Pn) + PP p,) =0, 


such that 


[p= pnsil _ | f" (En) 
Ip—pnl? | 2F"(pn) 


From equation (4.1) it follows that the Newton-Raphson method converges quadratically, with 


a =2and 
FG ftp) 
2f'(Pn) 2f'(p) 


A = lim 
n—- oo 
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4.5.1 Variants of the Newton-Raphson method 


In the Newton-Raphson method, the derivative f’(p;—1) should be computed. In order to cir- 
cumvent this, several variations on the standard Newton-Raphson method will be defined. They 
have the common form 

f (p ia) 


Pn = Pn-1 — 7 et (4.6) 
where K is an approximation of f’(p;,_1). 


Quasi-Newton method 
The Quasi-Newton method takes 


ae — fn +) ~ fPna) 
h y 
where h > 0. 


Secant method 
The Secant method replaces f’(pj—1) by 


K = £Pn-1) = f(Pn-2) 
Pn—-1— Pn-2 


This method is derived in exercise 3. 


Regula-Falsi method 

The Regula-Falsi method combines the Newton-Raphson method with the Bisection algorithm. 
Suppose that the interval [a,,_1,b,—1] contains a zero of f, such that f(a,_1) - f(by;—-1) < 0. Then, 
approximation p,, is computed as follows: 


Bn—1 — @n—1 


Pa = nt — fQn—1) FE Fa, a) 


If the chosen stopping criterion is satisfied, then the zero of f is found. If not, then, similar to the 
Bisection approach, the new interval [an, by| is constructed by defining 


Ay = Ay, and by = pn if f (@n—1) ‘ f (Pn) <0, 


and 
Ayn = pn and by = b,_, otherwise. 


4.6 Systems of nonlinear equations 


In this section, the theory will be extended to systems of nonlinear equations, which can be writ- 
ten either in fixed-point form 


S1(P1,-++,Pm) = Py 
go(p1,- * Pm) = P2, 


(4.7a) 
Sm(P1r- ++) Pm) = Pm, 
or in the general form 
ia — = 0, 
2\P1sreees = 0, 
a (4.7b) 
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In matrix-vector notation, these systems can be written as g(p) = p and f(p) = 0, respectively, 
where 


P= (P1r--+, Pm)! 
8(P) = (g1(P1r-+-1 Pm), -++18m(Pry-++,Pm))", 
f(p) = (fi(pr-++1 Pm), +++ fn (Pr +++) Pm)" 
As in the scalar case, an initial estimate for the solution is used to construct a sequence of suc- 
cessive approximations, {p™}, of the solution, until a desired tolerance is reached. A possible 


stopping criterion is ||p() — p("-))|| < e, where ||x|| = \/x7 +...+.x2, denotes the Euclidean 


norm. 


y 


4.6.1 Fixed-point iteration (Picard iteration) 


Similar to the scalar case, the fixed-point iteration approximates the solution to system (4.7a), 
using 

p”) = g(p")). (4.8) 
To illustrate the use of the fixed-point method, the following example is given. 
Example 4.6.1 (Picard iteration) 


In this example, the following system of equations is solved: 


2p1— pot+PT = 4, 
(4.9) 
—pit2po.+ ps =F. 
It is easy to see that the solution equals (p1, p2)' = (1/3,2/3) !. The Picard iteration uses 
=A 
apy — py + py” = 5, 
=] 
— py) +273" + py ps” = FB, 
which can be written as 
A(p'"-))p™ = bo p™ = AN(plD)p, (4.10) 


where 


(n— 1 

= 2+ —1 = 
A(p(@-)) = Py ae 9 : 

ew ( ae ae e 


If the initial estimate equals p) = (0,0) ', then the new approximation equals p“) = (5/9,1)'. 
Note that system (4.9) can be written as g(p) = p, with g(p) = A(p)~'b. 


4.6.2 The Newton-Raphson method 


A faster iterative method to approximate the solution to system (4.7b) is the Newton-Raphson 
method. Similar to the scalar case of the Newton-Raphson method, the successive approximation 


is found by linearizing function f about iterate pi): 


Oe | ae os Set Be 
File) © fle?) + (WO) (a — ph”) ++ SB) (Pmn — BH), 


n— Ofin n— n— Ofm n— n— 
Jn (P) © fn D+ ep! D)(p, — p! Dot D) (pm — ptt). 
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Defining the Jacobian matrix of f(x) by 


5A (x) as oh (x) 
sf (x) ‘itt on (x) 


the linearization can be written in the more compact form 


f(p) = £(p""") + (pp) (p— pp). 


The next iterate, p‘"), is obtained by setting the linearization equal to zero: 


f(p"—)) + (pO) (pM — p&Y) =0, (4.11) 


which can be rewritten as 


I(p"-Y)s) = —£(p@-Y), (4.12) 


("—1). The new approximation equals p'") = p\"—~)) + 5”). 


where s") = p(") — p 
A fast way to compute p'”) uses equation (4.11): 


pera pr ray pv yep”): (4.13) 


If the computation of the partial derivatives in the Jacobian matrix is not possible, then the deriva- 
tives can be approximated using, e.g., forward divided differences: 


a x fies ea F1) = = Fi) ; 


where e; is the ith unit vector. This is the Quasi-Newton method for systems. It is also possible to 
use central differences to approximate the derivatives. 


(4.14) 


Example 4.6.2 (Newton-Raphson method) 
In this example, the Newton-Raphson scheme is applied to the following system for p, and py: 
18p1— 9p2+ pp =0, 


(4.15) 
—9p1 + 18p2 + ps = 9, 


with initial estimate p = (0,0) '. 
System (4.15) can be written as f(p) = 0, where 
fi(Pi,p2) = 18p1— 9p2 + pt, 
fo(p1, P2) = —9p1 + 18p2 + p3 — 9. 
The Jacobian of f(x) is given by 


_ (18+2%, 9 
F(x) = ( 9 ice) 


Defining s“) = p{ — p(), equation (4.12) equals 


(p)s") = -#(p), 
such that the following system has to be solved: 


18s; — 9s§ =0, 
—9s()) + 185) — 9, 
The solution equals s‘!) = (1/3,2/3)', such that p™ = pO +s = (1/3,2/3)". 
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4.7 
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Summary 


In this chapter, the following subjects have been discussed: 


4.8 


ol 


Stopping criteria; 

Rounding errors; 

Convergence; 

Bisection method; 

Fixed-point iteration (Picard iteration); 

Newton-Raphson method; 

Quasi-Newton method, Secant method, Regula-Falsi method; 

Systems of nonlinear equations: Picard iteration, Newton-Raphson method. 


Exercises 


. Let f(x) = 3(x+1)(x —1/2)(x —1). Use the Bisection method to determine p2 on the 


intervals [—2,1.5] and [—1.25, 2.5]. 


. In this exercise, two different fixed-point methods are considered: 


20pn-1+21/p2_, pe asel 
| and ae soa a 1 


Show that for both methods, (21)!/° is the fixed point. Estimate the convergence speeds, 
and determine p3 if po = 1. 


. In this exercise, the Secant method is derived to approximate a zero of f. 


(a) Suppose that po and p; are two given values. Determine the linear interpolation poly- 
nomial of f between po and py. 

(b) Compute p2 as the intersection point of this interpolation polynomial with the x-axis. 
Repeat this process with p; and p2 to obtain p3, and derive a general method for com- 
puting py from py—2 and p,_; (Secant method). 

(c) Perform two iterations with this method using the function f(x) = x* — 2 with pp = 1 
and p; = 2. 


Consider the function f(x) = x —cosx,x € [0,7/2]. Determine an approximation of the 
zero of f with an error less than 10~4 using the method of Newton-Raphson. 


. Perform two iterations with the Newton-Raphson method with starting vector (1,1)! to 


solve the nonlinear system 
ay —X7 -3=0, 
— x1 +25 +1=0. 


Compare the approximation with the solution (2,1). 


Chapter 5 


Numerical integration 


5.1 Introduction 


Determining the physical quantities of a system (for example volume, mass, or length) often 
involves the integral of a function. In most cases, an analytic evaluation of the integral is not 
possible. In such cases one has resort to numerical quadratures. 


As an example, we investigate the production of a spoiler that is mounted onto the cabin of 
a truck (Figure 5.1). The shape of the spoiler is described by a sine function with a 277 meter 
period. The aluminium spoiler is made out of a flat plate by extrusion. The manufacturer wants 
to know the width of the plate such that the horizontal dimension of the spoiler will be 80 cm. 
The answer to this question is provided by the arc length of the curve 


x(t)=t, 
y(t)=sint, 0<t<08. 


Numerical 
Integration 


Height 


O OO 


Length 


Figure 5.1: Truck with spoiler on the cabin. 
To determine the arc length the formula 


0.8 ; 0.8 
= dy = 2 
=) 1+ (2) a fale dt 


is used. However, this integral cannot be evaluated in a simple way. In this chapter, numerical 
integration methods will be investigated in order to determine the arc length. 
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5.2 Riemann sums 


To begin this chapter, the definition of an integral is recalled. 


Definition 5.2.1 (Integral) A partition P,, of [a,b] is a finite number of distinct points x, such that 
a= x <x <-++: < xX, = b. Next, Ty is defined as a set of intermediate points t, such that 
Xe-1 << te < xp. The length of an interval is denoted by hk = xg — Xp_1 and the mesh width by 
m(Pn) = maxy<p<n {hg}. The Riemann sum for a function f € Cla, b| is defined as 


R(F, ParTn) = Yo Iaf (tt): 
k=1 


Let a sequence of partitions, P,, P2,..., and corresponding T,, T2,..., be given, such that lim m(Pn) = 0. 
n [o-e) 


We call f Riemann integrable over [a,b] if 
b 
R(f, Pn, Tn) converges toalimit T= | Fedde. 
a 


Riemann sums are usually used to study integrability theoretically, but they are not very useful 
in practice. Numerical integration rules have a similar structure to Riemann sums: 
n 
T= 1 wef (te)- 


k=0 


Here, t, are called the integration points, and w; are the corresponding weights. Numerical inte- 
gration rules are also called quadrature rules. 


5.3 Simple integration rules 


In this section, the integration of a function f over a single interval [x1, xp] will be considered. 
Several simple integration rules will be presented, together with their approximation errors. The 
effect of rounding errors will also be considered. 


5.3.1 Rectangle rule 


The most simple integration rule is called the Rectangle rule. Two different versions exist: 


XR xR 
‘| f(x)dx © (xg —xz)f (xz), and | Flx)dx © (xp — xz) f (xR), 
XL XL, 
which use the left and the right end point of the interval, respectively. 
Theorem 5.3.1 Let f € C![x,,xp], and m, = MAX y¢[x,,xg] Lf (x) |. Then 


1 
< =m4(xr — x1), 


[" Fendx— (ens fla0)) <5 


1 
< xm (XR = aa ge 


[Fedde — oe — x) fle) 


Proof: 
The proofs for the left and the right Rectangle rule follow the same structure. Therefore, only the 
left Rectangle rule will be considered. Using a Taylor expansion, we know that 


f(x) = f(x) + (x — x1) f'(E(x)) with E(x) € (x1, xR). 
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Integrating over [x,, xp] yields 


J fledae = f (fle) + x1) f(a) } dx = (er — a) fai) + f (x a1) FER) dx, 


Since (x — xy) > 0, the mean-value theorem for integration can be used, which means that there 
exists an 7, € [xL,Xp] such that 


XR 2 


[PF feae— an svften) = fl) [aide = fn) Fax 


XL 


Taking the absolute value on both sides and using m completes the proof. xX 


5.3.2 Midpoint rule 
The Midpoint rule uses the integration point x,y = (x, + Xr) /2, which leads to 
XR 
[° fedx® (xe — a1) F(m). 
L 
This rule is more accurate than expected at first sight. 


Theorem 5.3.2 Let f € C*[x,,xp], and mz = MAX ye [x,,xg] Lf (x) |- Then 


Proof: 
Using Taylor series, 


fle) = flea) + (x= xm) fem) + 5(a = 2a)?f" Ela), with &(x) € (xz). 
Integrating over [x , xp] yields 
LP poyax = (sam) + = saa) fra) + 300 — a)? F" Ela) 
= (xe— x1) flem) +5 ["(e—xm)?F" Ela) ae 


Since (x — xy)? > 0, the mean-value theorem for integration can be used, which means that there 
exists an 7 € [x,,xXp] such that 


[fae (en-au) flem) = 59" [aaa 


Using the definition of x,y it follows that 


[P Feo)ax— (en a) flom) = 5 f" (eR — 1). 


Taking the absolute value on both sides and using mz completes the proof. xX 
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5.3.3 Trapezoidal rule 


In the Trapezoidal rule, the linear interpolation polynomial on interpolation nodes x; and x z is 
used to approximate the integral of f € C[x,, xp] (see Section 2.2). This polynomial is given by 


f (EE 


XR — XL XR — XL 


XR—%X Pee 


Ii(x) = f (xr). 


The corresponding approximation of the integral is given by 


[ feoax re [° La (x)dx = *R>*£ (F(x,) + f(xr)). 6.1) 


This approximation is called the Trapezoidal rule, since (5.1) equals the area of the trapezium with 
vertices (x,,0), (xp,0), (xr, f(xR)) and (xz, f(xz)). 


Theorem 5.3.3 Let f € C?[xz,xp], and mz = MAX ye[x, xg} Lf” (x)|- Then 


Proof: 
From Theorem 2.2.1 it follows that linear interpolation through (x , f(x,)) and (xp, f(xr)) hasa 
truncation error 


f(x) — Lax) = 5(—x1)(e— xe) f" (E(x), with &(x) € (x1,*R). 6.2) 


Integrating both sides of equation (5.2) yields 


XR XR 
XL 


“RAL (F(x) + flee) = 5 [em )(x- ref" Cae.) 


XL XL 


Because (x — x, )(x —xr) < 0 for x € [xz,xp], the mean-value theorem for integration can be 
used, and there exists an 4 € [x,,Xp] such that 


5 | Ga) an) FE(a)) dx = SF") fe = a1) (x xe)dx = - yf" (xR - 21)*. 64) 


Using equation (5.4) in equation (5.3), taking the absolute value and using mz completes the 
proof. xl 


5.3.4 Simpson’s rule 


Using a quadratic interpolation polynomial with interpolation nodes x1, xy = (Xr — x,)/2,and 
Xp, the Simpson's rule can be derived. The integral of f is approximated by 


*R XR-X 
[O° fledx = (ra) + 4f lam) + £(eR)). 
L 
The corresponding remainder term will be presented without proof. 


Theorem 5.3.4 Let f € C*[xi, xr] and mg = MAX, €[x;,x] |f4) (x) |, then 


[ fe)dx— ECF) + 4f lem) + Fen) < segpmalee — 21), 
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5.4 Composite rules 


The integration rules that have been presented in Section 5.3 usually generate errors that are 
larger than the required accuracy allows. In this section, more accurate approximations will be 
obtained by applying a composite integration rule. 


The integral of f € C[a,b] will be approximated using a subdivision a = x9 < x41 <...< Xn =D, 
with x, =a+kh,k =0,...,n,andh = (b—a)/n. Using that 


[ seoex= [penis [Ppeoaxt + [" seyas, 


the integration rules of Section 5.3 can be applied on each subinterval [x;,_1, x,]. Here, x, = Xx_1, 
XR = Xp, and xy = (Xp_1+xp~)/2. If is an approximation of the integral es f(x)dx, for 
k =1,...,n, then the integral over [a,b] can be approximated by 


[fee = 3 Me f(x)dx = 3 = 1. 
e k=17*k-1 k=1 


In the following theorem, an upper bound for the remainder term of a composite rule will be 
considered. 


Theorem 5.4.1 Let f be given on [a,b], and leta = x9 < x1 <... < X%, = b, with x, = a+kh, 
and h = (b—a)/n. Let I, be an approximation of the integral La f(x)dx,k = 1,...,n. Assume 


that the absolute value of the remainder term of I, has upper bound c,-hP*1, with p € IN. Define 
c =max{c1,...,Cn}, then 


[fon e i < c(b—a)h?. 


Proof: 
Using the triangle inequality, 


[feo i = y (f° Foyax— i) 


Using the remainder term for each I; yields 


<i|f* pode —1 
sy fo fede h 


b n n 
i flsyax—1| SG he NS ehh Soe nee, 
. k=1 k=1 


Finally, we use that n - h = b — a to arrive at the given upper bound. Xx 


In the remainder of this section, the integration rules of Section 5.3 will be extended to their 


composite rules in order to approximate [, i f(x)dx. 


Rectangle rule 
The left and right composite Rectangle rules are given by 


n—1 
=k DS il) = hE @) + f(at+h)+...+f(b—h)), 


Ie = Ye flaw) =A (Fle +h) + f(a+2h) +... + f(b)). 
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From Theorem 5.3.1 it follows that for each interval [x,_1, xx], Ck = MAXye[x,_5,x,] |f (x) |- Defining 
My = maxyejqp)|f'(x)| = max{ci,...,Cn}, the corresponding upper bounds for the remainder 
terms are (use Theorem 5.4.1) 


b 


[ fedax— 1 < 5Mi(b—a)h, 
i: 
| flax — Ip = 5Mi(b—a)h. 


Midpoint rule 
The composite Midpoint rule is given by 


Im =hS, f (SF) =n(Fo 5h) + flat Sh) +... + f(b 5h). 


Defining Mz = max,¢{a,p)|f" (x) |, the upper bound for the remainder term (use Theorems 5.3.2 


and 5.4.1) is 
b 


J flax = Im 


a 


1 2 
< _ , 
M3(b ayh 


Trapezoidal rule 
For the Trapezoidal rule, the composite version is 


n 


5 Da lF Caen) +F0)) =H (FAO) + lath) +--+ FO H+ 5 FC), 


with remainder term (use Theorems 5.3.3 and 5.4.1) 


x)dx — It] < Mlb —a)k?, 


where, again, Mz = maxy¢ jap) |f" (x) |- 


Simpson’s rule 
The repeated Simpson’s rule is given by 


Is = . ¥ (For) +4f (A) +f) 


k=1 


1 2 3 
=n (2 F(a) t flat sh) + sfla+h) + sf lat sh) +...+5f—5h)4 ZF) 


with remainder term 


with Mg = maxy¢ {qj |f4) (x)]. 
Remarks 


e Asa consequence of the error estimates, the Rectangle rules are exact if f is a constant, the 
Midpoint and Trapezoidal rule are exact for linear polynomials, and Simpson’s rule is exact 
for polynomials of degree 3. 
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e The remainder terms of the composite Midpoint and Trapezoidal rule are of order O(h?), 
whereas the Rectangle rules have remainder terms of order O(h). The composite Simpson’s 
rule has a remainder term of order O(h*). The Rectangle, Midpoint and Trapezoidal rules 
all require the same amount of work. Among these, the Midpoint and Trapezoidal rules are 
clearly preferred. The composite Simpson’s rule needs twice as many function evaluations, 
but converges faster. 


Note that the upper bound for the error of the Trapezoidal rule is twice as large as the 
bound for the Midpoint rule. 


e The composite Trapezoidal rule may also be interpreted as the integral of the linear spline 
approximation of f. Suppose s is the linear spline approximation of f ina = xo,...,%n = D. 


Then the composite Trapezoidal rule computes Ir = [ . s(x)dx. 
Example 5.4.1 (Spoiler) 


For the example mentioned in the introduction, the length of the flat plate should be approxi- 


mated with an error of at most 1 cm. Defining f(t) = \/1-+ (cost), this length equals tee f(t)dt. 
Applying the composite left Rectangle rule, the upper bound for the remainder term equals 


[soma 


where My = max;<(0,0.8] |f’(t)|. The derivative of the integrand is given by 


1 
< 5M - 0.8h, 


POs —costsint _ —$ sin 2t 
J/1+ (cost)? /1 + (cost)? 


From this it follows that i 
IMO < 5 = Mi. 


Requiring that the error is smaller than 1 cm, f should be chosen such that 
Mi -0.8h < ; -0.8h < 0.01. 


Step size h = 0.05 meets this requirement. 


The exact value of the integral equals 1.0759354360688...m. For h = 0.05, the composite left 
Rectangle rule computes 1.0807 m, hence the error is, in fact, less than 1 cm. 


In Table 5.1 the errors are tabulated for different values of the step size h. Because the Rectangle 
rule is O(h), we expect that by halving h, the error of the Rectangle rule decreases by a factor 
2. The results are in agreement with these expectations. Similarly, the Midpoint rule and Trape- 
zoidal rule errors decrease by a factor 4, as expected because the methods are O(h?). Finally, the 
Simpson’s rule errors decrease by a factor 16, because the method is O(h*). 


Table 5.1: The error tie f(t)dt — i for f(t) = \/1+ (cost), using different composite integra- 


tion rules I. 


Left Rectangle rule | Midpoint rule | Trapezoidal rule 
O08 | 55435-10- 1.1698-10-2 | 2.2742-10-2 | 2.1789-10~ 
3.356710 2.7815-10-> | _5.5221-10~ T.3618-10- 


1.8174 - 107 6.8643 - 10-4 1.3703 - 107 8.4852: 10~ 
9.4302 - 107 1.7105 - 10-4 3.4194 - 10 5.2978 -10-® 
4.8006 - 10— 4.2728 - 10~ 8.5445 - 10~ 3.3102 - 10~ 
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5.5 Measurement and rounding errors 


Measurement and rounding errors can play an important role in numerical quadrature rules. 
In this section, this error behavior will be investigated. Therefore, we assume that the function 
values are perturbed by an error ¢ > 0, such that 


f(x) — fo) = efx). 
Defining émax = MaX;¢[q,] €(x) it follows that 


b 


b 
| (f0)~ Fe) ax} < fF) - fe) lax < emax(b~ a). 


a 


Next, the integral is approximated with an integration rule using the perturbed values. In this 
derivation, the composite left Rectangle rule will be used. The difference between the exact inte- 


gral, Z = f[ Ws f(x)dx, and the perturbed approximation, f,, can be bounded by 
|Z — [| = [fii t,| < IZ —I,|4+ |, — fL| 


n—1 n—1 
=|Z-h|+]h YO f(x —h YD f(x) 
k=0 k=0 


n—1 n—-1 
<|Z-h) +h Ye f(a) — f(a) = |Z - + Yo ela). 
k=0 k=0 


Assuming €(x) < €max, the total error for the composite left Rectangle rule is 


b 
n 1 n—-1 
J feoax — fy) < 5Mi(b—a)h +h emax 
k=0 


a 


1 1 
= 5Mi(b —a)h+h-nemax = (Smt + Em (b—a), 


where My = maxye{qp)|f/(x)|. Note that it is useless to take h any smaller than 2emax/M1, 
because then the measurement error dominates the total error. 


Note that this derivation is specific for the composite left Rectangle rule. Similar error estimates 
can be derived for the Midpoint, Trapezoidal and Simpson’s rule. 


Example 5.5.1 (Spoiler) 


The computation of the length of the flat plate can be perturbed assuming a small manufacturing 
error. Consequently, the integrand contains a rounding error ¢(x) = 107. In Figure 5.2, the effect 
of rounding errors can be seen for different values of h (using the composite left Rectangle rule). 
It turns out that the total error remains larger than 0.8 x 1073. It serves no purpose to take the 
step size smaller than 4€max = 4: 1077; 


Conditioning of the integral 

The numerical approximation of an integral Z is either a well-conditioned or an ill-conditioned 
problem. If €max is an upper bound for the relative error (different from €max for the absolute 
error), then the absolute error in the function values is bounded by the inequality 


Lf(x) — f(x)| < F(x) lemax- 
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Figure 5.2: The errors | ee t)dt—I | (without rounding errors) and IIe ae (t)dt — [| (with 
rounding errors) for f(f) = \/1+ (cost)2, using e(x) = 107°, and the composite left Rectangle 


rule. 


In that case, an upper bound for the relative error in the integral is given by 


b b b 
J iae "2 Ieee f F(x) [dx 
<2 * Emax: 
b ee: 
ETae piel 
The value ; 
J F(x) |dx 
Kr = rc 
ei 


is called the condition number of the integral Z. If Kz - €max = O(1), then the determination of TZ is 
ill conditioned: the relative error in the computed integral is of order O(1). 


Example 5.5.2 (Profits) 


The profits or losses per day of a car manufacturer depend on the season. The following profits 
formula (in billions of $) are assumed: 


Wepring(t) = 0.01 + sin (at - =) , t € [0,1], and wey (t) = 0.01 + sin (xt + =) ,t€ [0,1]. 
The total profits in spring (Wspring) and fall (Wefan) equal 


1 
Wspring = = fw spring (t t)dt = 0.01, Wea = = oun t)dt = = 0.01, 
0 


which equal 10 million $. 
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The composite left Rectangle rule using = 0.02 approximates Wepring with —0.01 and Wray with 
0.03. The composite Midpoint and Trapezoidal rule obtain the exact results for h = 0.02. 


Note that 


1 
Wspring = |Wepring | = 0.01 and i | spring (4) |at ~ 0.64. 
0 


Therefore, the condition number of Wepring equals Kz = 64. If the function values only have 
an accuracy of two digits, then Emax = 0.005. This means that Kz - €max = 0.32, such that the 
determination of the integral is an ill-conditioned problem. In the worst case, no digit of the 
approximation is correct. 


5.6 Interpolatory quadrature rules * 


Subsequently, quadrature rules based on higher-order interpolation will be discussed. Newton- 
Cotes quadrature rules will be considered as a particular case. In this section, integration over a 
single interval {x,, xp] will be considered. 


General quadrature rules 
Let x9,...,XN © [XL,xXR]. Let Ly be the Lagrange interpolation polynomial of f at the nodes 
X0,-.-,Xn,as defined in Section 2.3, equations (2.7) and (2.8). Then 


[Vseoaxe [ Ly (x jax = |" ¥ feodbant je = Ff 


using the weights 
XR 
wy = i Len (x)dx, (5.5) 
XL 
which do not depend on f. 


Theorem 5.6.1 Let xo,...,xn € [xz,Xp], and let f be a polynomial of degree at most N. Then every 
N + 1-point quadrature rule based on interpolation is exact for f: 


XR N 
[ Fedde = wif (x0. 


e=0 


Proof: 
From Theorem 2.3.1 it follows that f coincides with its Lagrange interpolation polynomial on 


[XL, XR], and hence, 
XR XR 
| flax = [En (eax 
XL XL 


Conversely: 


Theorem 5.6.2 If xo,...,Xy and wWo,...,Wy are given and if 


XR N 
J pdx = Y wep (xe) 
ie. £=0 


holds for all polynomials p of degree at most N, then ae wy f(x¢) is the quadrature rule based on inter- 
polation for f(x)dx 
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Proof: 
Let Ly be the interpolation polynomial of f on the nodes xo,...,xn. Then, by definition, 


XR N N 
J exo = d weLn (xe) = Do wef (xe), 
XL (=0 


l=0 


and equation (5.5) uniquely determines the weights wo,...,Wy. 
Note that the verification for Ly is already enough to prove the theorem. xX 


The following theorem can be used to compute remainder terms for interpolatory quadrature 
rules. The proof of this theorem will be omitted, and can be found in [9]. 


x . . . . . 
Theorem 5.6.3 Let [ a f (x)dx be approximated by the integration of the Lagrange interpolation poly- 
nomial Ly. Then there exists a € € [x.,xXp] such that the remainder term of the approximation satisfies: 


if N is even and f € CN**[xz, xp]: 


a f(x)dx — = Ly(x)dx} = 


N 
1 N 
x aaa SNE 


Remarks 


e Asa result of the theorem, interpolatory quadrature rules integrate polynomials up to de- 
gree N + 1 (if N is even) or N (if N is odd) exactly. 


e Composite interpolatory quadrature rules are found by repeating the interpolation on sub- 
intervals, as has been explained in Section 5.4. 


Definition 5.6.1 (Newton-Cotes quadrature rules) If x, = x9 < x1 < ... < XN = Xp and the 
nodes are equidistantly distributed, then the sum is wef (x¢) is called the Newton-Cotes quadrature 


rule for fs f(x)dx. 


The Trapezoidal rule (Section 5.3.3) uses linear interpolation in the nodes x, and xr. This means 
that the Trapezoidal rule is a Newton-Cotes quadrature rule with N = 1. Similarly, the Simpson’s 
rule (Section 5.3.4) uses quadratic interpolation in the nodes x ,, (x, + Xpg)/2 and xp: this is a 
Newton-Cotes quadrature rule using N = 2. Indeed, the error terms of the Trapezoidal and the 
Simpson’s rule correspond to Theorem 5.6.3. 


Newton-Cotes quadrature rules of higher order are rarely used: negative weights will occur and 
that is less desirable. 
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5.7 Gauss quadrature rules * 


In the previous section, quadrature rules have been derived by choosing the nodes x9,...,xn 
beforehand and subsequently determining the weights wo,..., wy. These weights are computed 
by integrating the polynomial that interpolates f on the nodes x;, £ = 0,...,N. These rules 
integrate polynomials up to degree N (or N + 1) exactly. 


Alternatively, the weights and the nodes can be determined together, in such a way that polyno- 
mials of a degree as high as possible will be integrated exactly. The 2N + 2 unknown values of we 
and x, can be used to generate a system of 2N + 2 nonlinear equations such that the polynomi- 
als 1,x,...,x2N+1 are integrated exactly. It can be shown that this system has a unique solution. 
The resulting quadrature formulae are called Gauss quadrature rules. The following illustration 
explains the computation of a two-point Gauss quadrature rule. Subsequently, the general ap- 
proach will be explained. In the derivation, the interval [—1, 1] is used. The procedure for general 
intervals [x,,, Xp] will be treated later on. 


The two-point Gauss quadrature rule requires weights wo, w1, and points xg, x; such that the 
quadrature formula 


ye f(x)dx © wof (xo) + wif (x1) 


computes the exact result when f is a polynomial of degree at most 3. Using f equal to 1, x, x? 
and x? consecutively, this means that wo, w1, xo and x; must satisfy 


f(x)=1 > wo +, =f ldx =2, 
f(x)=x => wox. +wim= f xdx =0, 


-1 


1 
<Q) =x" SS wore Awe = [rae Ss 
“1 


1 
xX) Sa os wx?) +0133 = x3dx = 0. 
0 1 


-1 


By symmetry of the equations, wo = w,, and therefore wo = w, = 1. Then it follows that 


For Gauss quadrature formulae, the integration limits are always taken equal to x, = —1 and 
XR = 1. Consequently, Gauss quadrature rules have the following general form 


1 N 
[flax Y wef) = To, 
1 (=0 
in which the 2N + 2 parameters x, and wy, £ = 0,...,N are such that 


We . N ‘ 
[tds = Lower, 7 =0,---,2N41 
=s (=0 
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Table 5.2: Weights and integration points for Gauss quadrature rules. 


holds. Tables with Gauss integration points and corresponding weights can be found in many 
text books on numerical analysis. In Table 5.2, the values are given for the first four Gauss quadra- 
ture rules. Note that the Gauss quadrature rule for N = 0 corresponds to the Midpoint rule. 


In order to integrate a function numerically over an arbitrary interval [x,,xp] using a Gauss 
quadrature rule, the interval of integration has to be transformed to |[—1, 1]. This change of inter- 
val is achieved by linear transformation using 


x = 1 = 
| " f(y)dy = 28 : “tf f (# sox t as) dx. (5.6) 
XL —1 


Replacing the latter integral by a Gauss formula yields 


XR xe —xp & xR—X xp+x 
XL l=0 


The integration points x¢ correspond to the Gauss integration points as tabulated in Table 5.2. 


If f € C?7N+2[ x7, xp], then the N + 1-point Gauss quadrature rule has the remainder term 


XR re Xe XR—-X xptxr\ — (xr—x1)?Nt((N+1)!)4 
[ flydy— BS ** Yet ( Rem t AS 8) = ON Le), 


with ¢ € (xz,Xpr). This formula shows that Gauss integration rules are indeed exact for polyno- 
mials up to degree 2N + 1. 


Remark 
Composite Gauss quadrature rules are found by repeating the Gauss quadrature rule on subin- 
tervals, as has been explained in Section 5.4. 


Example 5.7.1 (Polynomial) 


In this example, the value of ie y’dy will be approximated. The exact value of the integral 
equals 3.07861328125. Using a change of interval as in equations (5.6) and (5.7), the integral is 
approximated by 


15 N 
| F(y)dy © 0.25 Y7 wef (0.25xp +1.25), 
! (=0 


where f(y) = y’, and x, are the Gauss integration points. 
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In Table 5.3, the errors are shown for different values of N. As expected, the 4-point Gauss 
quadrature rule (N = 3) gives the exact result. Note that also for smaller numbers of integration 
points, the Gauss quadrature rule is very accurate. 


Table 5.3: The error iG y’dy —Ic 


5.8 


1 , using different N + 1-point Gauss integration rules Ic. 


|? “dy = Ic| 


6.9443 - 107 


1 1.1981 - 107 
2 2.4414 - 107 


Summary 


In this chapter, the following subjects have been discussed: 


5.9 


Riemann sums; 

Simple integration rules (left and right Rectangle rule, Midpoint rule, Trapezoidal rule, 
Simpson’s rule); 

Composite rules; 

Measurement and rounding errors; 

Interpolatory quadrature rules; 

Newton-Cotes quadrature rules; 

Gauss quadrature rules. 


Exercises 


. In this exercise, the integral 


1 
/ ((10x)3 + 0.001)dx 
“4 


should be computed. 
(a) The relative rounding error in the function values is less than e. Determine the relative 
error in the integral due to the rounding errors. 


(b) Use the composite Midpoint rule and ¢ = 4 x 107°. Give a reasonable value for the 
step size h. 


. Determine ie 5 x4dx with the Trapezoidal rule. Estimate the error and compare it with the 


real error. Also compute the integral with the composite Trapezoidal rule using step size 
= 0.25. Estimate the error using Richardson’s extrapolation. 


. Calculate f, - x’ dx with the composite Trapezoidal rule using two equal intervals and with 


the Simpson’s rule. Compare your results with the results for Gauss quadrature rules as 
tabulated in Table 5.3. 


Chapter 6 


Numerical time integration of 
initial-value problems 


6.1 Introduction 


An initial-value problem usually is a mathematical description (differential equation) of a time- 
dependent problem. The conditions, necessary to determine a unique solution, are given at the 
starting time, tg, and are called initial conditions. 


As an example, we consider a water discharge from a storage reservoir through a pipe (Fig- 
ure 6.1). The water is at rest until, at {9 = 0, the latch is instantaneously opened. The inertia of 
the water causes a gradual development of the flow. If this flowing water is used to generate elec- 
tricity, then it is important to know how long it will take before the turbines work at full power. 


Figure 6.1: The storage reservoir, the drain pipe and the latch in closed position. 


This process is described by the nonlinear initial-value problem 


| 41 — p(t) — ag?, t>0, 
q(0) = 0. 


(6.1) 


In this problem, q(m?/s) is the mass flow rate, p is the driving force, which is equal to 


force viene and measured in Bk = m/s" 
density kg/m 


y 
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and aq? is the friction. The driving force depends ~among other things— on the water level in the 
reservoir. Assuming that p(t) is constant (p(t) = po), a solution in closed form is known: 


alt) = [22 tanh yA), 


For general functions p, an analytic solution cannot be obtained. In that case, a numerical method 
must be used to approximate the solution. 


6.2 Theory of initial-value problems 


In this section, a short summary of the theory of initial-value problems will be presented. Before 
applying a numerical method for solving an initial-value problem, it is important to make sure 
that: 


e The initial-value problem has a solution; 
e The solution to the initial-value problem is unique; 


e The solution changes continuously when the parameters in the differential equation or the 
initial conditions are slightly perturbed. 


A theorem about these issues uses the concepts of a well-posed initial-value problem (see also 
definition 1.6.1) and Lipschitz continuity. 


Definition 6.2.1 (Well posedness) The initial-value problem 
(6.2) 


is called well posed if the problem has a unique solution, that depends continuously on the data (a, Yq and 


f). 


Definition 6.2.2 (Lipschitz continuity) A function f(t,y) is called Lipschitz continuous in y on a 
set D C RR’, if there exists a constant L > 0 such that 


If (t-y1) — f(t y2)| < Lys — y2|, for all (t,y1), (ty2) € D. 


The constant L is called the Lipschitz constant. 


The following theorem yields a sufficient condition for a function to be Lipschitz continuous. The 
proof uses the intermediate-value theorem. 


Theorem 6.2.1 [f the function f is differentiable with respect to the variable y, then Lipschitz continuity 
is implied if 


uw) < L, forall (t,y) € D, 
for some L > 0. 


The following theorem presents a sufficient condition for an initial-value problem to be well 
posed. The proof can be found in many text books on differential equations, for example in 
Boyce and DiPrima, [4]. 


Theorem 6.2.2 Suppose that D = {(t,y)|a < t < b,-—00 < y < oo} and that f(t,y) is continuous on 
D. If f is Lipschitz continuous in the variable y on D, then the initial-value problem (6.2) is well posed. 
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6.3 Elementary single-step methods 


In contrast to the example in Section 6.1, for many initial-value problems it is impossible to deter- 
mine a solution explicitly. Although there are generally applicable techniques to find qualitative 
properties of the solution, these properties usually provide insufficient quantitative information. 
For that purpose, numerical methods are designed to approximate the solution. Most numerical 
methods are designed for first-order initial-value problems. Numerical time-integration methods 
that can be applied directly to higher-order initial-value problems do exist but are not treated in 
this book. 


In this section, several methods will be presented that can be used to approximate the solution to 
the initial-value problem based on the scalar first-order differential equation 


{ y=f(ty), t> to, 
y(to) = Yo: 


Therefore, it is illuminating to consider the slope field generated by the differential equation. A 
differential equation is a rule that provides the slope of the tangent of a solution curve at each 
point in the (t, y)-plane. A specific solution curve is selected by specifying a point through which 
that solution curve passes. A formal expression for the solution curve through a point (f, Yo) is 
generated by integrating the differential equation over t: 


y(t) = y(to) + [ seamen) 


Because the unknown function y appears under the integral sign, this equation is called an in- 
tegral equation. Although the integral equation is as difficult to solve as the original initial-value 
problem, it provides a better starting point for approximating the solution. 


In order to approximate the solution to the integral equation numerically, the time axis is divided 
into a set of discrete points ty = to + nAt,n = 0,1,.... The solution at each time step is computed 
using the integral equation. Assume that the solution to the differential equation at time t, is 
known, and denoted by yn = y(t). Then, the solution at t,,,1 is equal to 


tn41 


Yns1 = Ynt f f(t-y(t) at. (6.3) 


The numerical time-integration methods that are presented in this section can be characterized 
by the approximation of the integral term [, va f(t, y(t) )dt: different numerical quadrature rules 


(Chapter 5) can be used. Because integral equation (6.3) only considers one time interval [ty, t+1], 
these numerical methods are called single-step methods. Multi-step methods, which may also use 
to,..-,t,—1, will be described in Section 6.11. 


The numerical approximation of the solution at ty is denoted by wy, n = 0,1,..., where wo = yo. 


Forward Euler method 

First, the Forward Euler method will be presented, which is the most simple, earliest, and best- 
known time-integration method. In this method, an approximation of the integral in equation 
(6.3) is obtained by using the left Rectangle rule (Figure 6.2(a)): 


Yn41 © Yn + (tn4a — tn) f (tn Yn)- (6.4) 


Note that f(tn, Yn) is the slope of the tangent in (fn, Yn). The numerical approximation of the 
solution at t,41, denoted by w,41, is computed by 


Wht. = Wnt Atf(tn, Wn), (6.5) 
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5.5 T 5.5 


Ff (tna. Yn41) F (tna, Yn41) 


0.5 0.5 
tn tn41 th tn41 


(a) Left Rectangle rule (b) Trapezoidal method 


Figure 6.2: Two different approximations of the integral /, Nee f(t, y(t) dt. 


5.5 


Exact 


= — — Forward Euler method Y5 
I I 


355 0.2 0.4 t 0.6 0.8 1 


Figure 6.3: Approximation of solution using the Forward Euler method. 


in which f(tn,Wn) is the slope of the tangent in (f;,wn). This procedure leads to a sequence 
{tn,Wn},n = 0,1,... of approximation points. Geometrically this represents a piecewise-linear 
approximation of the solution curve, the Euler polygon, see Figure 6.3. 


Because w,,41 can be computed directly from equation (6.5), the Forward Euler method is called 
an explicit method. 


Backward Euler method 
The Backward Euler method is obtained by using the right Rectangle rule to approximate the inte- 
grals. At time t,,+1, the solution is approximated by 


fn41 
Yvr = Int [F(t y(E))dt = Yn + Afra Yona): 
The numerical approximation at t = t,41 equals 


Wnt = Wn + Atf (tn41, Wn41): (6.6) 


Note that the unknown w,+, also occurs in the right-hand side. For that reason, the Backward 
Euler method is called an implicit method. If f depends linearly on y, then the solution to (6.6) can 
be easily computed. 
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For nonlinear initial-value problems, the solution to (6.6) is non-trivial. In that case, a numerical 
nonlinear solver has to be applied (Chapter 4). This means that the Backward Euler method uses 
more computation time per time step than the Forward Euler method. However, the Backward 
Euler method also has some advantages, as will be seen later. 


Trapezoidal method 
The Trapezoidal method is constructed using the Trapezoidal rule of Chapter 5 to approximate the 
integral (see Figure 6.2(b)). This means that 


tn41 At 
Yn = Yn + [ F(t,y(t))dt © yn + > (F(tn Yn) + ftn+1,Yn+t)) + (6.7) 


The approximation of the solution at ¢ = t,+1 equals 


Wnqi = Wn (f (ta, Wn) + f (tn41,Wn41)) - (6.8) 


Note that the Trapezoidal method is also an implicit method. 


Modified Euler method 
To derive an explicit variant of (6.8), the unknown w,+1 in the right-hand side of (6.8) is predicted 
using the Forward Euler method. This results in the following predictor-corrector formula: 


predictor: @y41 = Wn+Atf(tn, Wn), (6.9a) 
At = 
corrector: Wy41 = Wy + a (f (tn, Wn) + f (tn41,Wn41)) - (6.9b) 


From wn (calculated in the previous step), the predictor @,,4, can be computed. Using this value 
in the corrector, this results in an explicit formula: the Modified Euler method. 


6.4 Analysis of numerical time-integration methods 


The global truncation error e,,1 at time t,,41 is defined as the difference between the solution and 
the numerical approximation: 


Cn+1 = Yn — Wn+1- (6.10) 


In order to analyze the global truncation error, we need to introduce two important concepts: 
stability and consistency. 


6.4.1 Stability 


A physical phenomenon is called stable if a small perturbation of the parameters (including the 
initial condition) results in a small difference in the solution. Hence this concept is connected to 
the continuous dependence of the solution on the data. Some examples of unstable phenomena 
are buckling of an axially loaded bar or resonance of a bridge’. Throughout this chapter, we 
assume that the initial-value problem is stable. First, we will consider the conditions for an 
initial-value problem to be stable. 


Analytical stability of an initial-value problem 
Let an initial-value problem be given by 


{ y= f(ty), t> to, 
y( 


to) = yo. 


lnttp://ta.twi.tudelft .nl/nw/users/vuik/information/tacoma_eng.htm1 
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Suppose next that the initial condition contains an error ¢9. Then, the perturbed solution # satis- 
fies 
{ F=f(ty), t> to 
F(to) = yo + £0- 


The difference between the two functions is given by e(t) = #(t) — y(t). An initial-value problem 
is called stable if 
|e(t)| < co forallt > to, 


and absolutely stable if it is stable and 
lim |e(t)| = 0. 
too 
If |e(t)| is unbounded, then the initial-value problem is unstable. 


It should be noticed that stability of the initial-value problem does not imply that the numerical 
approximation is stable. The stability of the approximation depends on the numerical method 
that is used. In the rest of this section, a numerical stability analysis is performed, both for linear 
and nonlinear initial-value problems. 


The test equation 
We first analyze how a perturbation in the initial condition propagates for 


y= Ay + g(t), t> to, 
ee a 
The perturbed problem is given by 
j= Ag+ g(t), t> to, 
{ sto)=90 4 ox 


Subtracting the equations in (6.11) from those in (6.12), we obtain an initial-value problem for the 
error € in the approximation: 


f= -y' =AG-y)=Ae, t> to, 
6.13 
{ e(to) = £0. Age 


This initial-value problem plays an important role in stability analysis, and is often called the 
initial-value problem for the so-called test equation. 


The solution to (6.13) is 
e(t) = ege* (tt), 


which shows that (6.13) is stable if A < 0 and absolutely stable if A < 0. 


Numerical stability of initial-value problems based on the test equation 
Next, we explain how perturbations propagate if initial-value problem (6.11) is approximated by 
a numerical method. For example, the Forward Euler method applied to (6.11) yields 


Wy. = Wy + At(Awn + g(tn)). 
For the perturbed initial-value problem, we obtain 
Wy. = Wy + At(AWn + g(tn)). 


Subtracting these equations yields the following propagation formula for the numerical approx- 
imation of the error (€, = Wy — Wn): 


En4y = (L4AAP)En. 


This means that the error propagation for the Forward Euler method is modeled by the test 
equation. 
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For each of the numerical methods from Section 6.3, the application to the test equation can be 
written as 
En41 = Q(AAL)En. (6.14) 


The value Q(AAt) determines the factor by which the existing perturbation is amplified, and is 
therefore called the amplification factor. By substitution it may be verified that 


Forward Euler method: Q(AAt) = 1+ AAt, (6.15a) 
Backward Euler method: Q(AAt) = —— (6.15b) 
; 1+ 5AAt 
Trapezoidal method: Q(AAt) = —+=—_, (6.15c) 
— LAAt 
1 
Modified Euler method: Q(AAt) = 1+ AAt+ 5 (Aat). (6.15d) 


The amplification factor plays an important role in the analysis of the various numerical methods. 


Using equation (6.14) recursively it follows that @, = (Q(AAt))" @. This expression shows how 
the initial perturbation is amplified over time. For stability, we require that |@,| < oo for all 
n € N. This means that the numerical scheme is stable if and only if 


|Q(AAF)| < 1. (6.16) 


Because A is known, we can compute the values of At such that inequality (6.16) is still satisfied. 
For absolute stability, we require |Q(AAt)| < 1. Note that the method is unstable if |Q(AAf)| > 1. 
This leads to the following characterization of stability: 


absolute stability <= > |Q(AAt)| <1, (6.17a) 
stability <— > |Q(AAt)| <1, (6.17b) 
no stability << = |Q(AAt)| > 1. (6.17c) 


As an example, we investigate the Forward Euler method, where Q(AAt) = 1+ AAt. Here, 
we assume that A < 0, hence the initial-value problem is absolutely stable. To satisfy (6.16), At 
should be chosen such that 

—1<1+4+AAt <1, 


or 
—2<AAt <0. 

Since At > 0 and A < 0, the right-hand side is automatically satisfied. From the left-hand side it 

follows that 


a2 
At< —5, (6.18) 


with strict inequality for absolute stability. 

For the Backward Euler method, stability is found when 
ee Se ot 

iS 1—AAt — ! 


It is easily seen that this condition is satisfied for each At > 0, since A < 0. Therefore, the 
Backward Euler method is unconditionally stable if A < 0. 


A similar analysis shows that, for A < 0, the Modified Euler method is stable if 


2 
At < -—= 
pallies dN’ 
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with strict inequality for absolute stability. The Trapezoidal method is again unconditionally 
stable. 


Stability of a general initial-value problem 
For a general (nonlinear) initial-value problem, the stability of the corresponding linearized prob- 
lem will be analyzed. The initial-value problem 


y'=f(ty), t> to, 
{ y(to) = Yo j oe) 


is approximated by a linear Taylor expansion (Theorem 1.6.8) about the point (, 7): 


fey = fe +y-nLZE9 + -DLE9. 


In the neighborhood of (f, 7), initial-value problem (6.19) can be approximated by 


fi a a» Of on » nOfor . 
y= f6)+U-NLED + O-DZEM) =A +300, (6.20) 
with 
Of. 
A= xb 9), (6.21a) 
A Of.» a Of ,» 
s(t) = FE9) 95469) + 0-D LED). (6.21) 


The value of A in equation (6.21a) can be used in the requirement for stability (|Q(AAt)| < 1). 
Note that the stability condition depends on f and 9. 


Example 6.4.1 (Nonlinear problem) 


A simple model of the water-discharge problem is given by 


{ yf S 10i-:20; “tS 0, 
y(0)= 0. 


This model can be related to initial-value problem (6.1), using a = 10 and po = 20, and the 
solution is given by 


y(t) = V2 tanh(V3008). 
This means that y(t) € (0, V2) for all t > 0. 


In this example, f(t, y) = —10y? + 20. The linearized approximation of the differential equation 
uses 


Because y(t) > 0 for all t > 0, this initial-value problem is absolutely stable: A < 0. 


Furthermore, it is important to investigate numerical stability. Note that if the Forward Euler 
method is stable, then the numerical approximation of the solution satisfies wn € (0, V2) for all 
n € N. The maximum v2 can be used as an upper bound to derive a criterion for the largest 
admissible time step that ensures stability: 


At<—- 
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6.4.2. Local truncation error 


After having considered the stability of numerical methods for initial-value problems, the accu- 
racy of these methods will be analyzed. In this subsection, the new error in each time step (the 
local truncation error) will be considered. The cumulative effect of these errors in the final result 
(the global truncation error), will be considered in Section 6.4.3. 


Definition 6.4.1 (Local truncation error) The local truncation error at time step n + 1, T 41, is de- 
fined as 


— Yn41 7 2n41 
Th41 = “AR” 


in which Yy+4 is the solution at time step n+ 1 and z,4, is the numerical approximation obtained by 
applying the numerical method with starting point yn, the solution at time step n. 

The estimation of the local truncation error will be explained for the Forward Euler method and 
the Modified Euler method. We assume that all derivatives that we use are continuous. 


Forward Euler method 
For the Forward Euler method, we have 


Tr41 = Yn+1 — (Yn + Attn Yn)) . (6.22) 


Taylor expansion of y,,;; about ty, yields 


Yns = Y(tnsa) = yltn) + (tna — Endy! (tn) + Sy 
/ AP /) 
= Yn + Atyy, + =-9"6), (6.23) 
fora é € (tn, ty41)- 
Note that y satisfies the differential equation, such that 


y' = f(ty). (6.24) 
Using equations (6.23) and (6.24) in (6.22) yields 


yn + Ath, + SE y!(2) — yn + Atyh) — At, 
Tal = AE = a (¢), 


where € € (fn, tn41). This means that the Forward Euler method is of order O(At). 


Modified Euler method 
For the Modified Euler method, the value of z,,41 is obtained by replacing the numerical approx- 
imation wy, in equation (6.9) by the exact value yj: 


Znt41 = Yat Atf (ta, Yn), (6.25a) 
Znt1 = Yar (Flt Yn) + f(tn41Zn41)): (6.25b) 


Using a two-dimensional Taylor expansion about (tn, Yn) (Theorem 1.6.8), the final term of (6.25b) 
can be written as 


ftnt1,Zn41) = f (tn + At, yn + Atf (tn, yn)) 
of 


= f(b am) + (tnt dt to) (SE) + on + att bam) — ym) (FE) 


1 pf OE\. 4 2( Hf 
+ 5 (tn + AE tn) (SE) + 5+ att Onan) — yn) Oy? \ 


2 
+ (fn + At — tn) (Yn + Atf (tn, Yn) — Yn) (54) +.... 
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Writing fn = f (tn, Yn), this means that 
ftnt1Zn41) = fn t+ At (Z) + Atfn (F) 
+ sae (Sf) + +5 (Atha)? (Ft) +0rh (Se) a. (6.26) 


Note that the last three specified terms of (6.26) are O(At*) (and that the dots represent terms of 
order O(At?) and higher). Since higher-order terms are not required, we arrive at 


) ) 
faq 2n41) = fn + At (Zz +f35) + O(Af’). 
YS n 
Substitution of this expression into (6.25b) yields 


Zn = Yn + Atfa + (3 Say so) a OAL): (6.27) 


Differentiating the differential equation with respect to t and using the chain rule, yields 


ny — 4y _ af(ty(t)) _ of dt ofdy _ (of | ,of 
Oe ae ge Ode oR 2) 
Using relations (6.24) and (6.28), equation (6.27) transforms into 
At 2 
Zn41 = Yn + Aty;, + Yn + O(At?). (6.29) 


Note that the first three terms of this expression are exactly the first three terms of the Taylor 
expansion 


At? 
Yn = Yn + Aty, + Yn + (AE). (6.30) 


This means that 


Tn _ festa 3 O(At’). 


Local truncation error for initial-value problems based on the test equation 

For implicit time-integration methods, the procedure to estimate the local truncation error has to 
be adapted. Therefore, we only consider the error estimation for the test equation y’ = Ay. The 
treatment for implicit time-integration methods to the general equation y’ = f(t, y(t)) is outside 
the scope of this book. 


Remember that Z,,+1 is obtained by applying the numerical scheme to the solution y,. Using the 
amplification factor, this can be computed easily: Z,+1 = Q(AAt)yn (Section 6.4.1). Therefore, the 
local truncation error equals 


— O(AAt 
Tr41 = Yn+1 a yn (6.31) 


Furthermore, because the solution to the test equation equals y(t) = yoe™, it is clear that 
Vina = yor tA) = y,er4!, (6.32) 


Substitution of (6.32) into (6.31) yields 


Ta+1 = > one — ont (6.33) 
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This means that the size of the local truncation error is determined by the difference between 
the amplification factor and the exponential function. To determine this difference, the Taylor 
expansion of e*4! can be used (see Theorem 1.6.10): 


ehOt 1 + AAE+ 5(AAt) + SOA Ses One (6.34) 


3! 


Comparing (6.34) with Q(AAt) = 1+ AAt we conclude that the Forward Euler method has a local 
truncation error of order O(At). For the Modified Euler method, 


Q(AAE) = 1+AAE+ s(n) 


such that the local truncation error is O(At?) and this method is more accurate in the limit At > 0. 


For the Backward Euler method, a Taylor expansion of the amplification factor yields (see Theo- 
rem 1.6.9) 


= 1 _ 2 — = k 
Q(AAt) = Tovar AA (AAD Poe = ny : 


hence the local truncation error of the Backward Euler method is O(At). The Trapezoidal method 
has a local truncation error of order O(At*) (see exercise 3). 
6.4.3 Global truncation error 


In this section, we will relate stability, the local truncation error, and the global truncation error. 
Therefore, we need the definitions of consistency and convergence. 


Definition 6.4.2 (Consistency) A numerical time-integration method is called consistent if 


li At) =0, 
Jim, tr41(At) 


where (n+ 1)At = T, and T is a fixed time. We call a method consistent of order p if T41 = O(At?). 


Definition 6.4.3 (Convergence) A scheme is called convergent if the global truncation error y+ sat- 
isfies 


lim e = 0 
At—0 aan J 


where (n+ 1)At = T, and T is a fixed time. A method is convergent of order p if e, 41 = O(At?). 


The theorem below is one of the most important theorems to prove convergence. 


Theorem 6.4.1 (Lax’s equivalence theorem, [6]) [fa numerical method is stable and consistent, then 
the numerical approximation converges to the solution for At — 0. Moreover, the global truncation error 
and the local truncation error are of the same order. 


Proof 
This theorem will only be proved for the test equation y’ = Ay. In that case, we know that 
Wy41 = Q(AAt)wn, such that the global truncation error, e,, +1, satisfies 


Cn+1 = Ynt1 — Wns = Yn — Q(AAE) Wn. 
By definition, wn = Yn — en, such that 


Cnt1 = Ynt1 — Q(AAL) Yn + Q(AAL)en. 
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Next, using equation (6.31), we arrive at 
Cn+1 = AtT41 + Q(AAL)en. 


Repeating this argument yields 


n 


ens = (Q(AAL))® Atta ase. (6.35) 
l=0 


Using stability (|Q(AAt)| < 1) and (n+ 1)At = T, the global truncation error can be bounded by 


n n 
e < AAt)|£At|t,44_¢| < At|t,21_,| < T- m TI. 
Jensal S Po IQS |Atta sel SV Atltnga-el $ T+ max. |e 


This implies that the order of convergence of the global truncation error equals that of the local 
truncation error. Furthermore, the global truncation error tends to zero as At — 0, since the 
numerical method is consistent. xX 


6.5 Higher-order methods 


In this section, the importance of higher-order methods will be investigated. Furthermore, an 
example of a higher-order method will be treated. 


First, we illustrate the importance of the order of accuracy. One advantage of higher-order meth- 
ods is that a higher accuracy is achieved at least for sufficiently small time steps At. Since larger 
time steps are admissible to obtain a fixed accuracy, fewer time steps are required. However, the 
total amount of work also depends on the number of function evaluations per time step. For 
example, the Forward Euler method needs one function evaluation per time step, whereas the 
Modified Euler method needs two evaluations. In the next example, the total number of function 
evaluations is computed for a fixed accuracy. 


Example 6.5.1 (Order of accuracy) 


To illustrate the importance of a higher-order method, as a thought example we assume that 
the global truncation error for the Forward Euler method equals |e, | < At, and |en| < Af? for the 
Modified Euler method. If we integrate an initial-value problem from fp = 0 to T = 1, and require 
that |en| < 10~!, then it follows that At = 107! for the Forward Euler method, and At = 1/3 
for the Modified Euler method. This means that 10 function evaluations are required using the 
Forward Euler method, whereas only 3-2 = 6 evaluations are required for the Modified Euler 
method. 


This effect is even more visible when a higher accuracy is required, as can be seen in Table 6.1. 


Table 6.1: Number of function evaluations (denoted by # fe) for the Forward Euler method and 
the Modified Euler method using various accuracies. Integration is done from tg = 0 to T = 1, 
and At is chosen such that the desired global truncation errors are found. 


Forward Euler method | Modified Euler method 
lyn — wn 


107 107 10 0.3 6 


1072 1072 100 1071 20 
1074 104 10000 10-2 200 


Example 6.5.1 shows that a higher-order method is to be preferred, if a high accuracy is required. 
This observation is valid as long as the solution is sufficiently smooth. 
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The fourth-order method of Runge-Kutta (RK4 method) 

Next, the fourth-order method of Runge-Kutta will be introduced. The use of this higher-order 
method not only results in large savings, the RK4 method also has attractive stability properties 
despite its explicit nature. 


The RK4 method approximates the solution at t,41 by 


Wnt = Wn + 2(kr + ko + 2k + ke), (6.36) 
in which the predictors k; to kg are given by 
ky = Atf (tn, Wn), (6.37a) 
ky = Atf(tn + st, Wy t sh) (6.37b) 
kg = Atf (tn + sat, Wn + sh) (6.37c) 
kg = Atf (tn + At, Wn + ks). (6.37d) 


Note that the corrector formula is based on Simpson’s rule for numerical integration: 


fn41 


Yltnsi) = lin) + fe f(t y (Eat 


Bay 4 at (Fltn-ylts) +4f (1 zs saty (1 ie 54) +f (te + Ate y(tn + at))) 


Here, y(tn) is approximated by wy, and the values of y(tn + At/2) and y(tn + At) have to be 
predicted. From equations (6.37) it follows that both wy, +k,/2 and wy + k2/2 approximate 
y(tn + At/2), and that y(tn + At) is predicted by wy + kg. 


The amplification factor of the RK4 method can be computed using the test equation. Application 
of formulae (6.36) and (6.37) to y’ = Ay results in 


1 1 1 
Wnt = (1 ae Paya 5 (AAt)? + g (Ant) ht ay tan) Wn: 


This means that the amplification factor of the RK4 method is given by 


Q(AAt) = 14+ AAE+ 5(aaey? + F(AAt)° 4 5 (anys (6.38) 
Next, the local truncation error is computed for the test equation. The quantity z,,1 is defined 
by replacing wy in (6.36) and (6.37) by yn. Substitution of f(t, y) = Ay leads to 2,41 = Q(AAt) yn. 
Using that y,41 = eg (equation (6.32)), we obtain 


Yast —Zn41 = (e*4! — Q(AAR)) yn. (6.39) 
The right-hand side of (6.38) consists exactly of the first five terms of the Taylor series of e*4'. 
Hence in the right-hand side of (6.39), only the 5th and the higher powers of At remain. Division 
by At shows that the local truncation error is O(Af*). Also for a general initial-value problem, it 
can be shown that the RK4 method is of fourth order. 


Finally, it is important to investigate the stability properties. From (6.38) it follows that for each 
At > 0, Q(AAt) is larger than one if A is positive, hence exponential error growth will be in- 
evitable. 

For A < 0, the RK4 method is stable when 


1 1 1 
1 es Beer ae se box 4 vial <1, where x = AAt. (6.40) 
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The left inequality of (6.40) yields 


1 i 1 
2+x-4 ae ae | mae au 
which is satisfied for all x. This is seen by considering the extremes of the polynomial 
1 1 1 
Plt) = 2b ae a hai 


The extremes are attained at the zeros of P’, i.e. for those values % for which 


P’(%) H148+ 52428 =0. 


6 
An extreme value of P is found by substituting % into P and using the relation P’(%) = 0: 
Le ee ee a eee Oe a 1 4 
P(X) =2+%4 Set Ee on =1+ PR) + 558 = thing 


Note that P(%) > 0 for all ¥. This means that all extremes of P are positive, hence also the 
minimum of P. Consequently, P is positive for all x and that proves the proposition. The left 
inequality can therefore not be used to derive a stability bound for At. 


Next, the right inequality in (6.40) should be considered. This inequality can be written as 
x(1+ Ppt: + 43) 200) (6.41) 


Because x = AAt < 0, equation (6.41) is equivalent to 


1 1, 123 
eee —x3 > 0. 
1+ 5x + Ex + 54% >0 


Analysis of this polynomial shows that it possesses only one zero, for x + —2.8. The function is 
negative for x < —2.8, and positive for x > —2.8. From this it follows that x = AAt > —2.8, or 


At< a, (6.42) 


in order to satisfy —1 < Q(AAt) <1. 
Example 6.5.2 (Order of accuracy) 


To extend example 6.5.1, we assume that the global truncation error for the RK4 method equals 
len| < At*. Note that the RK4 method requires four function evaluations per time step. In Table 
6.2, the results can be seen. Note that for a certain accuracy, the RK4 method can use much larger 
time steps than the Forward Euler and the Modified Euler method. For higher accuracies, the 
number of function evaluations is therefore much smaller than for lower-order methods. 


Table 6.3 presents an overview of the stability conditions and local truncation errors for all dis- 
cussed methods. 


6.6 Global truncation error and Richardson error estimates 


In this section we analyze the behavior of the global truncation error as a function of the time 
step. For the given numerical time-integration methods, the global truncation error is of order 
O(At?P), with p = 1 for the Forward Euler and Backward Euler method, p = 2 for the Trapezoidal 
and the Modified Euler method, and p = 4 for the RK4 method. We assume that for sufficiently 
small values of At, the global truncation error can be approximated by 


e(t, At) = cp(t)At?P. (6.43) 


In general, the solution to the initial-value problem is unknown. In this section, we apply 
Richardson’s extrapolation (Section 3.7) to numerically estimate the global truncation error. 
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Table 6.2: Number of function evaluations (denoted by # fe) for the Forward Euler, Modified 
Euler and RK4 method using various accuracies. Integration is done from tg = 0 to T = 1, and 
At is chosen such that the desired global truncation errors are found. 


Forward Euler Modified Euler 
Lyn — 1 


107 107 10 0.3 


10-2 107-2 100 1071 1 
10-4 10-4 10000 107-4 He 2 


Table 6.3: Amplification factors, stability conditions (A € IR, A < 0), and local truncation errors. 
[method | amplification factor [ stab. condition [tia] 
Forward Euler 1+ AAt 
Backward Euler a 
14+ 5AAt 
1-4AAt 
Modified Euler 1+ AAE+ 4(AAE)? 
RK4 1+AAt + B(AAE)? + LAA + 


Trapezoidal 


6.6.1 Error estimate, p is known 


Similar to Section 3.7.2, two numerical approximations of the solution at time t are used to com- 


ute the global truncation error estimate. Let w24t, and w4! denote the approximations for 
Pp 8 N/2 N PP 


y(t) = y(NAt) using N/2 time steps of length 2At and using N time steps of length At, re- 
spectively. 


For At small enough (see Section 6.6.2), higher-order terms are disregarded, and hence we obtain 


e(t, 2At) = y(t) — wyyy = Cp(t)(2At)?, (6.44a) 
e(t, At) = y(t) wh =cp(t)At?. (6.44b) 


By subtracting these two equations we obtain 
WR — Wry = Cp(t)AtP(2? — 1), (6.45) 


from which the value of cy(t) follows: 


wy! — NID 


By substituting this estimate of c,(t) into (6.44b), we obtain a global truncation-error estimate for 
the approximation with time step At: 
At 7,2At 
Wy — WwW 
t)— wif = Ye 6.47 
y(t) — wy Pp] (6.47) 


The order of the method manifests itself in the factor a of the global truncation-error estimate. 


6.6.2 Error estimate, p is unknown 


In the practical situation in which both the solution and the order of the global truncation error 
are not known, there is no immediate quality check of the global truncation-error estimate. For 
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a certain value At, it is unknown whether relation (6.43) is sufficiently well satisfied to obtain an 
accurate truncation-error estimate. Applying the same approach as in Section 3.7.2, and defining 


writ as the approximation for y(t) using N/4 time steps of length 4At, equation (3.14) gives that 


2At AAt 


etme =P. (6.48) 
WN — WN/2 


If for a certain At the value of equation (6.48) is in the neighborhood of the expected 2?, then At 
is small enough to obtain an accurate global truncation-error estimate. 


Example 6.6.1 (Water discharge) 


In this example, Richardson’s extrapolation is applied to a slightly modified water-discharge 
problem (see initial-value problem (6.1)), given by 


{ y! = 50 — 2y2, 
y(0) = 0. 


An explicit solution to this problem is not known. Table 6.4 gives the numerical approximation 
of the solution at f = 0.2 using the Forward Euler method for various time steps. The estimate for 
y(t) — wy! (equation (6.47)) is also given, together with the estimated value of 2?, using equation 
(6.48). Note that indeed, the global truncation-error estimate is doubled when 2At is taken. The 
results in the fourth column show that from Aft = 0.00125 onwards the global truncation error 
may be considered linear in At. 


eee (6.49) 


Table 6.4: Numerical approximation and global truncation-error estimates for the solution to 
problem (6.49) at t = 0.2, using the Forward Euler method and equations (6.47) and (6.48). 


W 


0.01000000000 
0.00500000000 
0.00250000000 
0.00125000000 
0.00062500000 


0.00031250000 
0.00015625000 
0.00007812500 
0.00003906250 
0.00001953125 


N 
4.559913710927 
4.543116291062 
4.534384275072 
4.529943322643 
4.527705063356 
4.526581601706 
4.526018801777 
4.525737136255 
4.525596237317 
4.525525771331 


-0.01679742 
-0.00873202 
-0.00444095 
-0.00223826 
-0.00112346 
-0.00056280 
-0.00028167 
-0.00014090 
-0.00007047 


1.9236589 
1.9662485 
1.9841099 
1.9922881 
1.9962008 
1.9981144 
1.9990607 
1.9995312 


In general, the global truncation error at a certain time t should satisfy |e(t, At)| < e, for a certain 
bound e. Richardson’s extrapolation can be used to determine whether a certain value of At is 
accurate enough. 


In practice, At is often based on knowledge of the general behavior of the solution or an idea 
about the number of points necessary to recognize the solution visually. For example, to visualize 
the sine function on the interval (0, 7t), about 10 to 20 points are required. To estimate the global 
truncation error for this value of At, the numerical integration is performed three times (for At, 
2At and 4At), and the value of (6.48) is computed. Note that if the estimation of 2? is not accurate 
enough, then At should apparently be taken smaller. In practice, the time step is halved, such that 
two earlier approximations can be reused. As soon as 2? is approximated accurately enough, 
an estimate for the global truncation error according to (6.47) is computed, using the last two 
approximations. If the estimated global truncation error exceeds the threshold, we again take a 
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smaller time step, until the error estimate is small enough. The approximation of the solution 
will become more and more accurate because (6.48) is satisfied better and better as At decreases. 


This characterizes the use of adaptive time-stepping in the numerical integration of initial-value 
problems. More details about adaptive time-stepping can be found in [8]. 


6.7 Numerical methods for systems of differential equations 


This section focuses on numerical time-integration methods for systems of differential equations. 
Here, we assume that 1,...,Ym are m unknown functions and that each function is characterized 
by an initial-value problem of the form 


{ Yj = fit Yur- Ym), t > to, 
Yj(to) = Yjo, 


j = 1,...,m. Writing this in vector notation by using y = (y1,..-,Ym)' and f = (ft,..-,fm)', 
yields the vector-valued initial-value problem 


For the Forward Euler method, it will be shown that the application to systems of equations is 
similar to the scalar case. Similarly, the vector-valued generalization of the scalar case will be 
presented for the Backward Euler, Trapezoidal, Modified Euler, and RK4 method. 


For systems of differential equations, the Forward Euler method (equation (6.5)) should be ap- 
plied to each differential equation separately. For the jth component, the solution at time t,+1 is 
approximated by 


Wind = Win At fi(tn, Winr--+,Wmn), (6.50) 


j=1,...,m. The corresponding vector notation reads 
Writ = Wat Atf(tn, Wi): (6.51a) 


Indeed, the Forward Euler method for systems of differential equations is a straightforward 
vector-valued generalization of the corresponding scalar method. Using the definitions of the 
Backward Euler (equation (6.6)), Trapezoidal (equation (6.8)), Modified Euler (equation (6.9)) 
and RK4 method (equation (6.36)) we obtain their vector-valued generalizations: 


Backward Euler method: Writ = Wa t+ Atf(tn41, Writ), (6.51b) 

Trapezoidal method: Writ = Wit a (£(tn,Wn) + £(tr41,Wn+1)), (6.51c) 
Modified Euler method: Writ = Wi t+ Atf(tn, Wn), 

Wrst = Wnt Se (Flue Wn) + ECA Wns), (651d) 

RK4 method: k, = Atf(tn, Wn), (6.51e) 

kp = Atf(ty + sat, wr + ski), (6.51f) 

k3 = Aff (ty + 5A, eee 5ko), (6.512) 

ky = Atf(tn + At, Wn +ks), (6.51h) 

Writ = Wi (ki + 2ky + 2k3 + ky). (6.513) 
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For the implicit methods, an approximation of the solution to a nonlinear system should be de- 
termined at each time step. This can be achieved using a method from Section 4.6. 


Higher-order initial-value problems 
A higher-order initial-value problem relates the mth derivative of a function to its lower-order 
derivatives. Using the notation 
. dix. 
x) = Ha j=0,...,m, 
we consider the following initial-value problem: 


xl) = f(t,x,x),...,x0-D), £> to, 


x(t) = x0, 


x(™-1) (tg) = x"), 


which consists of a single mth order differential equation and a set of initial conditions for x and 
all of its derivatives up to the m — 1th order. 


One way to solve such a higher-order initial-value problem is to transform the single differential 
equation into a system of first-order differential equations, by defining 


y=%, 
y2= x), 
Yn = x (m1) 


In this way, problem (6.52) transforms into the following first-order system: 


4 = Ye, 


fe oe 
Yin—-1 = Ym 


Vin = F(t Yt +. Ym), t > to, 


y1(to) = Xo, 
ym(to) = x6"). 


Example 6.7.1 (Mathematical pendulum) 


The angular displacement of a pendulum is described by the initial-value problem 


y”+sinp=0, t>0, 

(0) = to, 
y'(0)=0. 

By letting y; = p and y2 = y’, the corresponding first-order system reads 


=-—siny;,, t>0, 
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The Forward Euler method approximates the solution at t, +. by 
Wint1 = Win + Atw2n, 
Wn = Won — AtsinW,n, 


or, in vector notation: 


W2n 
Ww =w, +At cx; : 
rt if ( —sinwW1n ) 


Example 6.7.2 (Second-order initial-value problem) 
Consider the initial-value problem 
ax" + bx’ +cx= g(t), t> to, 
x(to) = xo, 
PA ese 
The second-order differential equation can be transformed to a system of first-order differential 
equations by defining y; = x, y2 = x’: 


4 = Yo, 

y= —2y2 — Lyi +ig(t), t> to, 
yi(to) = *o, 
Yo(to) = xh ) 


By defining 


me(2 E)e (8): amt st (28)): 


we arrive at the matrix-vector form y’ = Ay + g(t). 


6.8 Analytical and numerical stability for systems 


It becomes clear from Section 6.7 that the numerical methods for systems of differential equations 
have a similar structure as their scalar counterparts. In this section, we will analyze their stability 
behavior and relate it to that of the scalar case. 


As in the scalar case, we first consider a simplified initial-value problem for a linear differential 


equation with constant coefficient matrix, which constitutes a direct generalization of the scalar 
case (6.11): 
/ 
{ y(to) = yo. ee2) 
Here y and g(f) are vectors of length m and A is the m x m coefficient matrix. Assuming pertur- 
bations only in the initial values, the perturbed problem is given by 


{ y =Ayig(t), t> to, 
¥(to) = yo + €0. 


For the difference e = y — y, the test system holds: 


{ e’=Ae, t> to, 
e(to) = €o. 


If we assume that the matrix A can be diagonalized, then the stability properties of system (6.54) 
can be related to the stability properties of a scalar initial-value problem. It will be shown that 
the eigenvalues of the matrix A (see definition 1.6.5) play the same role as A in the scalar test 
equation y’ = Ay. The major difference to the scalar case is the possible occurrence of complex 
eigenvalues that has to be taken into account. 


(6.54) 
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6.8.1 Analytical stability of the test system 


The analytical stability of the test system is explained in depth in Section 6.8.2. The most impor- 
tant conclusion is that the system of differential equations is analytically stable depending on the 
eigenvalues A; of A (see definition 1.6.5). The following characterization holds [4]: 


absolute stability <== forall Aj;: Re(Aj) <0, 
stability <= forall Aj: Re(A j) <0, 
no stability <= there existsaAj;: Re(A;) > 0. 


The following example is used to show how the procedure works. 
Example 6.8.1 (Second-order initial-value problem) 


In this example we use the system of differential equations as given in Example 6.7.2. The corre- 
sponding test system is equal to y’ = Ay. The eigenvalues of A are given by the solutions to the 
equation 


det(A — AI) = 


O-—A 1 | 


b iy hoe 8 Cs 
_s ee a( ; A) +o ante Dest 


a a 
Multiplying this equation by a leads to the equivalent equation aA* + bA + c = 0, with solutions 


—b+ Vb — 4ac : 


2a 


M2 


note that the eigenvalues are complex if b? — 4ac < 0. Table 6.5 contains some possible choices 
for a,b, and c, the corresponding eigenvalues and the stability result. 


Table 6.5: Analytical stability investigation for initial-value problem of Example 6.8.1. 


pa} bie}; A |’ ___| stability result _| 
if 2];o] 0 |-2 [stable 


T[=2;o, 2 [0 [unstable _| 


6.8.2 Motivation of analytical stability * 


If A is diagonalizable, then there exist a non-singular matrix S and a matrix A such that 
A= SAS"1. (6.55) 


Here, A = diag(A1,...,Am) contains the eigenvalues of A (which can be complex)*, and S is the 
corresponding eigenvector matrix having the (right) eigenvectors vj,..., Vm of A as columns. For 
the sake of completeness we recall the definition: Av j = Aiv;, f= Nywteg Ms 


Let y = S~‘e, such that « = Sy. Substituting this transformation in (6.54) and using factorization 
(6.55) yields 
{ Sy'= ASy =SAn, 
y(to) = S~ leo = Yo. 


2We write diag(A1,...,Am) to denote a diagonal matrix with A1,...,Am on the main diagonal. 
& 8 & 
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Since S is non-singular, this yields 


y= Ay, 
{ (to) = No- seo) 


Note that the above system is fully decoupled. 
For each component of the vector 4, the solution is given by 
4 = nj oei'—), f= dpe 
Note that in general, the eigenvalues can be complex. Writing each eigenvalue as 
with y1j,v; € R it follows that 


"ij = nj get (to) vito) 


Because |e!”i(‘~'0)| = 1 (since |e| = |cos€+isiné@| = \/cos?@+sin2é = 1 for all € € R), it 
holds that |7;| = |yj,o|e" (‘~to). Therefore, 


Injol, if pj =0, 
lim |y;| = 0, ify <9, 
cee oo, if uj > 0. 


From this observation it follows that the system is stable if vj = Re(A;) < 0 for all j = 1,...,m. 
Absolute stability is obtained when pj = Re(A;) < 0 for all j = 1,...,m. If there exists a j for 
which 1; = Re(A;) > 0, then the system is unstable. 


6.8.3 Amplification matrix 


For each numerical time-integration method, the amplification matrix can be computed by ap- 
plying the time-integration method to (6.54). We obtain 


Forward Euler method: Q(AtA) = 1+ AtA, (6.57a) 
Backward Euler method: Q(AtA) = (I— AtA)"}, (6.57b) 
-1 
Trapezoidal method: Q(AtA) = (1 = sata) (1 + sata) ; (6.57c) 
2 AP 2 
Modified Euler method: Q(AtA) =1+4+ AtA >A ; (6.57d) 
AP ABs Att 
RK4 method: Q(AtA) =1+AtA as + et + aE (6.57e) 


6.8.4 Numerical stability of the test system 


As has been mentioned in Section 6.8.1 and motivated in Section 6.8.2, the stable integration of 
system (6.53) is only possible if all eigenvalues of the matrix A have non-positive real parts. These 
are stable systems with non-increasing solutions. 


In Section 6.8.5, the numerical stability of the time-integration methods for systems will be de- 
rived. The most important conclusion is that for numerical stability, the scalar amplification 
factor can be used. If 

JQ(AjAt)| <1 


1,...,m), then the numerical method is stable. Absolute stability 


for all eigenvalues A; of A (j = 
| < 1 for all eigenvalues, and instability when there is at least one 


is obtained when |Q(A;Af) 
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eigenvalue for which |Q(A;At)| > 1. This leads to the following characterization of stability for 
systems (cf. equations (6.17)): 


absolute stability <=» forall A; : |Q(A;At)| <1, 
stability <=> forall A;: |Q(Aj;At)| < 1, 
no stability <= thereexistsaA;: |Q(AjAt)| > 1. 


Note that in general, A; is complex, such that the modulus of Q(A;At) needs to be computed (see 
definition 1.6.3). 


In this section, we will investigate the stability conditions for several time-integration methods. 
Example 6.8.2 (Stability condition for the Forward Euler method) 


Absolute stability for the Forward Euler approximation of a system of differential equations is 
obtained when all eigenvalues Aj of the matrix A satisfy the strict inequality 


J 1+ AjAt |< 1, (6.58) 


j=1,...,m. Writing Aj = yj + iv;, with p,v; € IR yields (apply definition 1.6.3) 


[1+ AjAt|= | (1+ pjAt) + ivjAt |= \/(1 + pjAt)? + vFAP. 


Taking squares on both sides of the stability condition (6.58) yields the following inequality: 
(1+ pjAty +7AP <1. (6.59) 


This inequality has to be satisfied by all eigenvalues of matrix A to ensure absolutely stable nu- 
merical integration. Note that the above equation cannot be satisfied if eigenvalue A; has a posi- 
tive real part (vj; > 0). Hence, if at least one eigenvalue of the matrix A has a positive real part, 
then the system is unstable. 


The corresponding stability condition for At follows from the equation (1+ Aty ae + veAt =i, 
The nonzero root of this quadratic polynomial in At is 


Qu; 
At=-—, Mi 7 
ee 


For each eigenvalue Aj the stability condition is given by 


2 27 
py 


At < — 


j=l1,...,m. 


If A; is real-valued, then this condition can be related to the scalar case (condition (6.18)). Ensuring 
that the above condition is satisfied for all eigenvalues, the integration of the system is absolutely 


stable if At satisfies 
' 2H 
j=1,...m BM + Vj 
For an imaginary eigenvalue (1; = 0), the Forward Euler method is always unstable. 
Example 6.8.3 (Unconditional stability of the Backward Euler method) 


Stable integration using the Backward Euler method is obtained when 


: <1 


|Q(AjAt)| = [1 — Ajai S 
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for all eigenvalues Aj =pj+ iv; of A. Note that this is equivalent to |1 — Aj At| > 1, which holds 
when 
| 1—AjAt =| 1 — pjAt — ivjAt |?= (1 — pjAt)* + (vjAt)? > 1. (6.61) 


If all eigenvalues have non-positive real part ju; (which is required to obtain an analytically stable 
system), then equation (6.61) is automatically satisfied, independent of the time step At, and 
therefore, |Q(A;At)| < 1. In that case, the Backward Euler method is unconditionally stable. 


If all eigenvalues are imaginary (4; = 0), then the Backward Euler method is absolutely stable, 
since |Q(A;At)| < 1 for all eigenvalues. 


Example 6.8.4 (Unconditional stability Trapezoidal method) 

For the Trapezoidal method, the amplification factor for eigenvalue Aj =p+ iv; is 

1+ zAjAt 
1 


| (A+ 3pjAt) + givjAt | 


| Q(AjAt) | = : 
} | (1— 51 jAt) —- sivjAt | 


(1+ gp jAt)? + (Zv;At)? 


(1 — Zp jAt)? + (4v;At)? 


For eigenvalues with non-positive real part (4; < 0) the numerator is always smaller than or 
equal to the denominator, and hence, |Q(A;At)| < 1. If this holds for all eigenvalues, then the 
Trapezoidal method is stable for each time step, hence this method is unconditionally stable. 


If all eigenvalues are imaginary (y/; = 0), then the Trapezoidal Method is stable, since |Q(A;At)| = 
1 for all eigenvalues. 


6.8.5 Motivation of numerical stability * 


Theorem 6.8.1 Let A € IR"*™ be diagonalizable, and A = diag(Ay,...,Am) be the matrix with the 
eigenvalues of A, with corresponding eigenvector matrix S (having right eigenvectors v1,...,Vm of A 
as columns). Then for each numerical time-integration method, the amplification matrix, Q(AtA), is 
diagonalizable, and its eigenvalues are given by Q(A,At),..., Q(AmAt): Q(AtA) = SQ(AtA)S~1. 


Proof 
First, it should be noticed that 


A‘ — (SAS~!)‘ = SAkS~1, fork € Z. 
From this observation it follows that any polynomial 
P(x) = pot pixt...+ pnyx"?, np EN, 
evaluated in AtA equals 
ip Np Np 
P(AtA) = )~ px(AtA)* = )~ p,S(AtaA)‘S~! = § (x niaea)') S-'=SP(AtA)S1, (6.62) 
k=0 k=0 k=0 
in which P(AtA) = diag(P(A;At),..., P(AmAt)). 
Next, note that each amplification matrix in Section 6.8.3 can be written as 
Q(AtA) = R(AtA)~'P(AtA), (6.63) 


in which P(AtA) and R(AtA) are polynomials in AtA. For example, for the Trapezoidal method, 
we have 


1 1 
R(AtA) = 1— 5 AtA, and P(AtA) = 1+ SAtA. 
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Using equation (6.62) in equation (6.63) yields 
Q(AtA) = (SR(AtA)S~!)~!(SP(AtA)S~!) = SR(AtA)~!P(AtA)S~1. (6.64) 


Note that R(AtA)~! and P(AtA) are both diagonal matrices, and therefore, their product is also 
a diagonal matrix: 


R(AtA)~1P(AtA) = diag(P(A,At)/R(AqAt),..., P(AmAt)/R(AmAt)) 
= diag(Q(A,At),..., Q(AmAt)) = Q(AtA). (6.65) 


The above derivation makes use of the fact that the amplification matrix is built using the scalar 
amplification factors. 


Substituting the expression (6.65) in equation (6.64) proves that Q(AtA) = SQ(AtA)S71. xX 


In order to investigate numerical stability of time-integration methods, the test system (6.54) is 
used. For each time-integration method, the approximation at time t,41 equals 


Enii = Q(AtA)en. (6.66) 


Next, using that e = Sy (as in Section 6.8.2), equation (6.66) is transformed into (using Theorem 
6.8.1) 


Sif. = Q(AtA)Sy,, = SQ(AtA)m, 
Because S is non-singular, this means that 
Wns = Q(AtA)y,, 


which can be written componentwise as 


Mont = QArAt) Mn, 
Hon+1 = Q(A2At)H2n, 


Ymn+1 = Q(AmAt) Ym,n- 


Stability of y,, is equivalent with stability of e,, since S does not depend on the approximation. 
This means that a sufficient condition for numerical stability is 


JQ(AjAB)| <1, f=1,...,m. 


If for one of the eigenvalues |Q(A;At)| > 1, then the system is unstable. 


6.8.6 Stability regions 
In this section, the stability regions of the different methods are investigated graphically. 


First of all, the Forward Euler method will be considered. In Figure 6.4(a), the region where this 
method is stable is depicted: 


SFE = {AAt EC: 1+ AAt| < 1}. 


Note that this set represents a circle with midpoint (—1,0) and radius 1. For points AAt € Sre 
the modulus of the amplification factor is less than 1 and outside the circle it is larger than 1. 
The region inside the circle is therefore called the stability region of the Forward Euler method. The 
stability region of the Forward Euler method lies completely to the left of the imaginary axis and 
is tangent to that axis. 
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The stability region may be used to provide a graphical estimate of the stability condition for 
At. As a starting point, each eigenvalue A; should be marked in the complex plane (which cor- 
responds to assuming At = 1). The position of the marked point shows whether the integration 
is stable for Af = 1 or not. If any marked point lies outside the stability region, then At must 
be taken smaller, such that AjAt falls inside the stability region. Graphically, this corresponds 
to moving from the marked point to the origin along a straight line. In many cases even the 
unassisted eye can determine fairly accurately for which At the line intersects the circle. This 
process has to be performed for all eigenvalues; the smallest Af that results determines the sta- 
bility bound. 


Obviously, this operation will only succeed if all eigenvalues have a strictly negative real part. 
Shrinking At will never bring AAft inside the stability region when an eigenvalue has positive 
or zero real part, the latter because the stability region is tangent to the imaginary axis. Imagi- 
nary eigenvalues are often found when representing systems with oscillatory solutions without 
damping. 


In Figure 6.4, all stability regions can be seen for the different numerical time-integration meth- 
ods. Note that both implicit methods (Backward Euler and Trapezoidal method) are uncondi- 
tionally stable when all eigenvalues have non-positive real part. If A j falls in the white region of 
Figure 6.4(b) (unstable case of the Backward Euler method), then At has to be increased to reach 
the stability region. 


The stability region of the Modified Euler method is very similar to that of the Forward Euler 
method. As a consequence both methods have almost the same time step constraint. In partic- 
ular, both methods share the property that systems containing imaginary eigenvalues cannot be 
integrated in a stable way, whichever the time step is: unbounded error growth will ensue. 


The RK4 method is conditionally stable. It is important to note that this is the only discussed 
explicit method that may be stable for systems which have imaginary eigenvalues. In that case, 
RK4 is stable if |AAt| < 2.8. 


Note that the observations in this section can be related to Section 6.4.1, in which numerical 
stability was derived. 


6.8.7 Stability of general systems 


Differential equations for systems of the form 


{ y’ = f(t,y), 
y(to) = Yo, 


locally have the same properties as the linear system (6.53). The stability behavior can be deter- 
mined by linearization of the initial-value problem about (#, ¥) (note that a similar approach was 
adopted in Section 6.4.1 for general scalar equations). The role of A is taken over by the Jacobian 
matrix of f, evaluated in (f, 9): 


oft oft 
oy} “h OYm 
Jlan= |: (6.67) 
Ofm fm 
ayn "Oye TNEg) 


Note that the eigenvalues of the Jacobian matrix, A1(f,9),...,Am(#,9), depend on f and ¥. The 
stability properties of numerical time-integration methods therefore depend on time and approx- 
imation. 


90 Numerical Methods for Ordinary Differential Equations 


Im(AAt) 7 ? 


Re(AAt) 


(a) Forward Euler method (b) Backward Euler method 
Im(AAt)y 2 2 
Im(AAt) 
1 
Re(AAt) 
; 2 
ef 
29 
(c) Modified Euler method (d) Trapezoidal method 


(e) RK4 method 


Figure 6.4: Stability regions of numerical time-integration methods. 
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Example 6.8.5 (Competing species) 


In this example, we investigate the population densities of two species of bacteria competing 
with each other for the same supply of food. A model for this competition leads to the system of 
differential equations 


y= "11-1 — y2), 
Y5 = Y2(0.5 — 0.75y1 — 0.25y2), 


in which y; and y2 are the population densities of the bacteria. The solution to this system of 
equations is unknown, and therefore numerical time-integration methods are applied. 


Assume that the approximation at time ty is given by wy = (1.5, o)!. We choose f = ty, ¥=wn, 
and linearize the system about (f, 7). The Jacobian matrix of this model, evaluated at (#, 9), is 
given by 


J 


nets Peer —hy Se, 18 
(LY) —0.75i2 0.5 — 0.7591 — 0.5H2 0 —0.625 }’ 


with eigenvalues A, (7,9) = —2, Ax(é, ¥) = —0.625. 


In order to compute the approximation at time t,+41, the Forward Euler method is used. Since 
the eigenvalues are real, criterion (6.18) is applied to each eigenvalue to show that the method is 


stable in (f, 7) if 
2 
j=12 | A,(L¥) 
Note that the same procedure needs to be executed at each single time step. 


Example 6.8.6 (Stability of mathematical pendulum) 


In this example, the mathematical pendulum of example 6.7.1 is revisited. Note that the corre- 
sponding system can be written as 


yi = yw =filt,y), 
¥ =—siny, = fa(t,y), 


with y = (y1,¥2) |. The Jacobian matrix, evaluated at (?, 7), equals 


J 


nae 0 1 
(49) ~ | —cos#, 0 |" 
If —7/2 < qf, < 7/2, then the eigenvalues of this matrix are imaginary, which are given by 


Ai2(é, y) = +1,/ COS 1. 


In Figures 6.4(a) and 6.4(c), it can be seen that the Forward Euler method and the Modified Euler 
method are unstable for every At when the eigenvalues are imaginary (this has also been shown 
for the Forward Euler method in example 6.8.2). 


However, the Backward Euler method and the Trapezoidal method are unconditionally stable 
(cf. Figures 6.4(b) and 6.4(d)). We remark that the amplification factor of the Backward Euler 
method satisfies |Q(A1,2(£, ¥))| < 1, such that a damped oscillation is found when this method is 
applied (cf. example 6.8.3). However, since imaginary eigenvalues imply neither losses (dissipa- 
tion or damping) nor growth of the solution, the Trapezoidal method is more useful in this case 
({Q(Ai,2(£, ¥))| = 1, see example 6.8.4). 


92 Numerical Methods for Ordinary Differential Equations 


The RK4 method is stable if (cf. Figure 6.4(e)) 
2.8 2.8 


At < SS = 
lAr2(E¥)| — \/cos #1 


which means that a suitable time step for the RK4 method depends on ¥ and ?. If At is strictly 
smaller than the stability bound, then the RK4 method gives a damped oscillation. 


6.9 Global truncation error for systems 


The analogy of the scalar and vectorial case extends further to global truncation errors. As in 
the scalar case, y,,+1 is the solution at time t,,41, and Z,+, is the approximation at time t,.; after 
applying one step of the numerical time-integration method to y,. Then, the local truncation- 
error vector is given by 
Yu+1 — 2n4+1 

Fire eee (6.68) 
and the global truncation-error vector is given by en41 = Yn+1 — Wn+1. It turns out that the 
orders of these errors are equal to the corresponding orders in the scalar case (Section 6.9.1). For 
example, the Forward Euler method for systems has a local truncation error of order O(At), and 
if the method is stable and consistent, then the global truncation error is of the same order. 


Tn41 


For systems, Richardson’s extrapolation should be performed componentwise to obtain global 
truncation-error estimates (Section 6.6). 


6.9.1 Motivation of the global truncation error for systems * 


Test system 

First, the test system y’ = Ay will be investigated (cf. proof of Theorem 6.4.1). In that case, 
the numerical approximation at time t,,,, equals w,,1 = Q(AtA)w» (equation 6.66), such that 
Zn41 = Q(AtA)yn. The local truncation-error vector is therefore given by (cf. equation (6.31)) 


= Yun+1 — Q(AtA)yn 


Tn4+1 AL 


and the global truncation-error vector by en41 = yn41 — Q(AtA)Wn. Because Wn = yn — en, the 
global truncation-error vector equals 


€n41 = Yn+1 — Q(AtA)yn + Q(AEA Jen = Atty +1 + Q(AtA)en. (6.69) 


Next, equation (6.69) is decoupled using the transformation e;,,1 = Sy,,,,, where S is the matrix 
having the eigenvectors, v1,...,Vm, of A as columns. This means that 


Si... = AtTn41 + Q(AtA)Sy,,, 
such that (multiply by S~! and apply Theorem 6.8.1) 
na = MS! T41 +S 'Q(AtA)Sy, = AtS~!Tr41 + Q(AtA) I, (6.70) 
in which A is the matrix having the eigenvalues of A, A1,...,Am, on the diagonal. 
Next, the local truncation-error vector T,,+1 is decomposed along the eigenvectors of A: 
m 
a= Xu Mj n+1V¥j = San41- (6.71) 


The values a; 741 are the components of the local truncation-error vector with respect to the basis 
of eigenvectors. 
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Transforming S~'t,41 into S~'Sa,41 = &n41, the decoupled global truncation-error vector of 
equation (6.70) equals 


nar = Ateny1 + Q(AtA)y,. 
As in equation (6.35), this results in 


n 


Nn = Y. (Q(AtA))! Ata y4—o. 
l=0 


Because Q(AtA) = diag(Q(A,At),..., Q(AmAt)), the components of the global truncation-error 
vector are given by 


n 


£ . 
intl = x (Q(A;At)) Ato n+1-0 Gj = 1, easy m). (6.72) 
l=0 


Note that this is a decoupled system: the scalar properties of local and global truncation errors 
can be applied to each component (as in Sections 6.4.2 and 6.4.3). 


This automatically entails that each component of the decoupled local truncation-error vectors 
&1,...,0,41 has the same order of magnitude as the scalar local truncation error of the corre- 
sponding method, for example of order O(At) for the Forward Euler method. If a numerical 
method is stable and consistent, then the decoupled global truncation error is of the same or- 
der as the decoupled local truncation error (see Theorem 6.4.1). Because the matrix S does not 
depend on At, the original local and global truncation error are also of the same order. 


6.10 Stiff differential equations 


Stiff differential equations (stiff systems) describe problems that exhibit transients. Their solution 
is the sum of a rapidly decaying part, the transient, and a slowly-varying part. After a short 
period of time, the transient is invisible, and only the slowly-varying part of the solution remains: 
the quasi-stationary solution. 


In this section, we explain the concept of stiffness using the following initial-value problem with 
a linear differential equation: 


{ y' = A(y— F(t))+ F(t), £> to, 
y(to) = Yo, 


with solution y(t) = (yo — F(to) )e*'») + F(t), as may be verified directly by substitution. This 
initial-value problem is stiff if A is strongly negative and if variations of F occur on a large time 
scale (slowly varying). Then, the transient is (yo — F(to))e*“~*») (rapidly decaying) and F(t) is 
the quasi-stationary solution. 


We consider approximations using a numerical time-integration method. When we choose an 
explicit method, then stability is obtained if the time step is chosen small enough. Since A is 
strongly negative, the condition |Q(AAt)| < 1 leads to a very small bound for At. Note that this 
restriction is only due to the transient part of the solution, in which A occurs. With respect to 
the accuracy to approximate the quasi-stationary solution, larger time steps could be taken. This 
means that the stability condition restricts the time step more than the accuracy requirement. 
Therefore, implicit time-integration methods (which are unconditionally stable) are preferable 
for stiff problems. 


Considering the error behavior, it is important to note that local truncation errors are relatively 
large in the beginning, since the transient decays very rapidly. Since the quasi-stationary solution 
is slowly varying, these errors are smaller at later times. The global truncation error, that adapts 
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to these local truncation errors, also decreases after passing the transient phase. This can be 
derived from definition (6.35): 


n 


ens = Y_ (Q(AAL))’ Attra1_¢ 
l=0 


= (Q(AAB))" Att, + (Q(AAE))"! Atty +... + Q(AAL) Atty + Atty 41. (6.73) 


Hence if the time-integration method is stable and the amplification factor is small enough (for ex- 
ample, when |Q(AAt)| < 0.5), then, due to exponential decay, initial local truncation errors have 
much less influence than local truncation errors in the last few steps, which have not decayed 
that much. The smaller |Q(AAt)| is, the faster the effect of past local truncation errors decays. 
The global truncation error could even be unnecessarily small when the time step is chosen very 
small in relation to the timescale of variation of F. This is not efficient if the approximation after 
a long time is required. 


Implicit methods 

In the previous discussion, it has been explained that explicit time-integration methods are not 
very useful for stiff systems, as the stability condition requires a very small time step. Therefore, 
it is more efficient to apply an implicit time-integration method, such as the Backward Euler, or 
the Trapezoidal method. These methods are unconditionally stable, and therefore, the time step 
can be taken arbitrarily large, at least theoretically. In practice, it needs to be chosen such that 
sufficient accurate approximations of the quasi-stationary solution are obtained. Note that for 
each time step, a system of algebraic equations has to be solved. 


Both implicit methods have their unconditional stability in common but exhibit, in applications, 
significant difference in behavior as shown by the following experiment. 


Example 6.10.1 (Stiff differential equation) 


In this example, we investigate the scalar stiff problem 


fe ; 
{ i= 100(y—cost)—sint, t>0, (6.74) 


y(0)=0, 


with solution y(t) = —e7!! + cost (A = —100). The solution is approximated using the Back- 
ward Euler and the Trapezoidal method, in both cases with time step size 0.2. The size of the 
transient region is of order 0.01, which means that the first time step already exceeds this region. 
This means that the first local truncation error is large. The ease with which the global trunca- 
tion error diminishes differs, as can be seen in Figure 6.5. Comparing the approximations to the 
solution makes it clearly visible how the neglected transient influences the global truncation er- 
ror. The Backward Euler method is favorable: after four time steps the solution curve is almost 
reached. The Trapezoidal method needs more time steps to let the large initial local truncation 
error decay enough. This can be explained using the amplification factors of these methods (re- 
member equations (6.15b) and (6.15c)): 


Backward Euler method: |Q(AAt)| = | ae | row | a & 0.048, 
: . _  fltgadt} _ J 1-31000.2) 9 
Trapezoidal method: |Q(AAt)| = i-lam| = |isttooo2) = 1 * 0.82. 


Using these amplification factors in equation (6.73) shows that the initial local truncation error is 
damped out much faster when the Backward Euler method is used. 
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--- Exact --- Exact 


— Backward Euler method — Trapezoidal method 


(a) Backward Euler method (b) Trapezoidal method 


Figure 6.5: Approximation of the solution to problem (6.74) using implicit methods, At = 0.2. 


Systems 

In practice, stiffness is often encountered in systems. For systems of the form y’ = Ay + f, the 
solution is given by the sum of the homogeneous and the particular solution. The system is stiff 
if at least either of the following criteria is satisfied: 


e Some of the real parts of the eigenvalues are strongly negative, whereas some of the other 
eigenvalues have a real part close to zero; 


e The particular solution varies much more slowly than the homogeneous solution does. 


In both cases, the essential feature is that the solution contains two timescales: a fast transient 
(which determines numerical stability of explicit time-integration methods) and a slowly-varying 
component. 


Example 6.10.2 (Stiff systems) 


Stiffness in systems occurs mostly in the following form. Suppose, for example, that y’ = Ay, 
where A is a 2 X 2 matrix, having eigenvalues A; = —1 and Az = —10000. The solution is of 
the form y(t) = cyvye~! + covge~ 190 in which v1 and v2 are the eigenvectors of A and the 
integration constants c, and cz are determined by the initial conditions of the two variables. The 
term proportional to ee plays the role of the transient: this term vanishes much sooner than 
the term containing e~'. The latter is slowly-varying compared to the transient and plays the role 
of the quasi-stationary solution. However, the transient still determines the stability condition 
(Forward Euler method: At < 2/10000) over the whole domain of integration. This inhibits 
adaptation of the time step to the relatively slowly-varying quasi-stationary part containing e~'. 


Definition 6.10.1 (Superstability) A numerical method is called superstable if it is stable and 


li AAt iB 

Ahi 8 ls 

Figure 6.6 shows that the Backward Euler method is superstable, whereas for the Trapezoidal 
method, 


li AAt)| = 1. 
skp J 


This means that initial perturbations in the fast components do not decay, or decay very slowly 
when using the Trapezoidal method. 
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- - - Backward Euler method 
—— Trapezoidal method Q(AAt) 


Figure 6.6: Amplification factors of the Backward Euler and Trapezoidal method. 


This section only dealt with a simple system of differential equations, but stiffness may also 
occur in more complicated systems. Also in that case, implicit methods are to be recommended. 
However, at each time step, a nonlinear system of equations has to be solved, which increase the 
computational cost considerably. Therefore, the choice of a method is often a matter of careful 
consideration and explicit methods cannot be ruled out beforehand. 


6.11 Multi-step methods * 


All methods discussed so far in this chapter are single-step methods: the approximation at ty4 
depends solely on information of the previous point in time t,. Although various methods use 
function evaluations at intermediate points, this information is only obtained and used inside the 
interval [tn,tn+1]- 


Since approximations at times tg, t1,...,t, are available, it seems reasonable to develop methods 
that use this information to design higher-order approximations. These methods are called multi- 
step methods. In general, ¢-step methods require a single-step method (ideally of the same order) 
to create the required number of wo, w1,...,W )_1 Values to start. 


As a particular example of a two-step method, we consider the Adams-Bashforth method. 


Adams-Bashforth method 
For the derivation of the Adams-Bashforth method we start with the Trapezoidal method as 
constructed in (6.8): 


Wap = Wn + SEU Clad) +f lines ns): (6.75) 


The method can be made explicit by extrapolating to ty; +1 using f(ty-1,Wn—1) and f(tn, Wn). The 
linear interpolation polynomial based on ¢,_; and ty is given by 


Ly(t) = f(tn—-1,Wn-1) + a (f (tn, Wn) — f(tn—1,Wn—-1), 


see equation (2.1). This means that Lj (t,41) = 2f(tn,Wn) — f(tn—1,Wn—-1), and using this instead 
of f (tn41,Wn+1) in (6.75), the Adams-Bashforth method is obtained: 


3 1 
Wnt = Wn + Att (tn, Wn) ~ sAtf (tn, Wn—1)- (6.76) 
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Note that only one function evaluation is required per time step, since f(t,—1,Wy,—1) has already 
been computed during the previous time step. 


For n = 1, only one previous value is known, which is based on the initial condition. In order 
to compute a second value, a single-step method of the same order (in this case the Trapezoidal 
method, or the Modified Euler method) is applied to the initial condition. Setting wo equal to 
yo and wy to the single-step approximation for time t;, the multi-step method can be applied to 
compute w2. For n > 2, the multi-step algorithm is applied. 


Stability 
For the test equation y’ = Ay, the time-integration method yields 


Wnt+1 = (1 + saat) Wn — SAA y 1. (6.77) 


In order to compute the amplification factor, we assume that 
Wn = Q(AAL)Wy_1 and Wy4. = Q(AAt)2wy_1. 
Using this in equation (6.77) yields 


Q(AAt)2wy_1 = (1 of saat) Q(AAt)wy_1 — sade, oe 


such that the amplification factor should satisfy 
2 3 1 
Q(AAt)* — (14+ shat Q(AAE) + shat = 0. 


The roots of this quadratic polynomial are 


1+ 3Adt+ VD 


Qi(AAt) = 5 i 
sa 
in which 5 
D=1+AAt+ grat)’. 
As usual, the Adams-Bashforth method is stable if 
|Qi2(AAt)| < 1. (6.78) 


The values of Q; 9(AAt) are visualized in Figure 6.7. Note that stability bound (6.78) is satisfied 
for all values of Q;(AAt), but |Q2(AAt)| < 1 implies AAt > —1 (see lower bound in Figure 6.7). 
Therefore, the time step should be bounded by 


~le 


At <— 


in order to have a stable method. 


Local truncation error 
In order to derive the local truncation error, we use a Taylor expansion of the amplification fac- 
tors. Here, we make use of the fact that for x sufficiently close to zero, 
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Figure 6.7: Amplification factors Q) (AAt) of the Adams-Bashforth method. The stability bounds 
are also visualized. 


Applying this expansion to approximate V/D (taking x = AAt+9/ 4(AAt)?) the amplification 
factors equal 


Qi (AAt) = 1+ AAE+ s(t)? = paar)? + O(A#*), 
Q>(AAt) = saat +O CARY. 


The use of Q;(AAf) leads to a local truncation error for the test equation of (cf. equation (6.33)) 


AAE _ Q,(AAt 
eae CN, = 2 PAP Yn + O(A#) = O(AP?), (6.79) 
whereas Q>(AAt) has a local truncation error of order O(1). This is the reason why Q;(AAf) is 
called the principal root, whereas Q2(AAt) is called the spurious root. This root does not belong to 
the differential equation, but it is a consequence of the chosen numerical method. 


Comparison with the Modified Euler method 

Regarding stability the time step in the Adams-Bashforth method has to be chosen twice as small 
as in the Modified Euler method. Since the Modified Euler method costs twice the amount of 
work in each time step the methods are comparable in this respect. 


To obtain reasonable accuracy we choose the time step in such a way that the local truncation 
error, T,+1, is less than e. For the Adams-Bashforth method it follows that (using equation (6.79) 
and neglecting higher-order terms) 


12¢ 

BAS’ 

For the Modified Euler method, the local truncation error for the test equation is (using the am- 
plification factor as given by equation (6.15d)) 


At ap = = 


TH = FAP + O(A8), 
such that the same approach yields the time step 


6€ 
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This means that the time step of the Adams-Bashforth method is /2/5 ~ 0.63 times the time step 
of the Modified Euler method. 


Because the Adams-Bashforth method reuses previous approximations, two steps of the Adams- 
Bashforth method require the same amount of work as one step of the Modified Euler method. 
This means that with the Adams-Bashforth method, the approximation at time interval 2At,p 
can be computed using the same amount of work as with the Modified Euler method on interval 
Atme, whichis 2-2/5 % 1.26 times as long as the Modified Euler method’s time interval. Hence 
the Adams-Bashforth method requires less work than the Modified Euler method. 


Discussion 

Multi-step methods are less popular than Runge-Kutta methods. With multi-step methods start- 
ing problems occur, since approximations should be known at several time steps. Moreover, 
keeping track of the ‘spurious roots’ is a difficult matter. Finally, adaptive time-stepping is more 
difficult to apply. 


6.12 Summary 


In this chapter, the following subjects have been discussed: 


- Theory of initial-value problems; 

- Elementary single-step methods (Forward Euler, Backward Euler, Trapezoidal, and Modi- 
fied Euler method); 

Analysis of numerical time-integration methods: 


- Test equation, amplification factor, stability; 
- Local truncation error; 
- Consistency, convergence, global truncation error; 


Higher-order time-integration methods (RK4 method); 
Global truncation error and Richardson error estimates; 
Numerical methods for systems: 


- Test system, amplification matrix, stability; 
- Local and global truncation error; 


- Stiff systems, superstability; 
- Multi-step method (Adams-Bashforth method). 


6.13 Exercises 


1. Use the Modified Euler method to approximate the solution to the initial-value problem 


{ y=1+(t-y), 2<t<3, 
y(2)=1, 
with At = 0.5. 


The solution is given by 


1 


Determine the error in the numerical approximation. 


2. The Midpoint method is given by 


At At 
Wry = Wnt Atf (ty + Wn + ZF (tn, Wn). 


Show that the local truncation error of the Midpoint method is O(At?). 
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. Determine the amplification factor of the Trapezoidal method. Estimate with this amplifica- 


tion factor the local truncation error of the test equation. Show that the Trapezoidal method 
is stable for all At > 0,if A <0. 


. Consider the nonlinear initial-value problem: y’ = 1 + (t — y)?. Give the stability condition 


of the Modified Euler method at the point t = 2 and y = 1. 


. Consider the numerical time-integration method 


Wn41 = Wnt BAtf (tn, Wn), 
Wn+1 = Wn41 + (1— B)Atf (tn + BAL, Mn41). 


(a) Show that the local truncation error is O(At) for each value of 6. Moreover, show that 
there is no B such that the truncation error is O(At?). 


(b) Determine the amplification factor of this method. 


(c) Consider the nonlinear differential equation y’ = 2y — 4y*. Determine the maximal 
step size such that the method (with 6 = 1/2) is stable in the neighborhood of y = 1/2. 


. Perform a single step with the Forward Euler method with step size At = 0.1 for the system 


yj, = —4y1 —2y2 +e, 


5 = 3y1 + Yo, £10; 
yi(0)= 0, 
y2(0) = —1. 


. Perform a single step with the Forward Euler method for the initial-value problem 


y” —2y'+y= te’—t, t>0, 
y(0)=0, 


using step size At = 0.1. Determine the error with respect to the solution 


1 
y(t) = gfe —te' + 2e' —t-2. 


. Consider the equation of motion of the mathematical pendulum: 


oe +%p=0, t>0, 


~(0)=0, 
(0) = 0. 


Write this as a first-order system. Is this system stable? 


. Consider the system 


y} = 1195y1 — 1995y2, 
y= 1197y, —1997y2, t>0, 


The solution is given by 


yi (t) = 10e~2# — 8¢~800F, 
Yo(t) = 6e~2! — Be 8008, 
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10. 


11. 


(a) Perform a single step with the Forward and Backward Euler method with At = 0.1 
and compare the results with the exact answer. 


(b) Determine for which step sizes the Forward Euler method is stable. 
(c) Perform a single step with the Forward and Backward Euler method with At = 0.0001 
and compare the results with the exact answer. 


What is your conclusion? 


Consider the initial-value problem 


{ y=y-P4+1, t>0, 
y(0) = 5. 


Approximate y(0.1) = 0.6574145 with the Forward Euler method, using At = 0.025, and 
the RK4 method, using At = 0.1. Which method is to be preferred? 


The solution to an initial-value problem is approximated with two different methods. Both 
methods have used step sizes At = 0.1,0.05,0.025. The numerical approximations at the 
point T = 1 are tabulated below. Determine the order p of the global truncation error of 
both methods. 


Table 6.6: Solution approximations of an initial-value problem at T = 1, computed by two differ- 
ent methods. 


At method 1 method 2 
0.05 0.750686 0.730912 
0.025 0.750180 0.740587 
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Chapter 7 


The finite-difference method for 
boundary-value problems 


7.1 Introduction 


Many applications can be simulated by solving a boundary-value problem. A one-dimensional 
boundary-value problem consists of a differential equation on a line segment, where the function 
and/or its derivatives are given at both boundary points. 


Stationary heat conduction in a bar 
As an example we consider the temperature distribution in a bar with length L and cross-sectional 
area A (Figure 7.1). 


0 xx + Ax L 


Figure 7.1: Bar with length L and cross-sectional area A. 


We denote the temperature in the bar by T(x) (measured in K), and assume that the temperature 
is known at both ends: T(0) = T; and T(L) = T;. Further, heat is assumed to be generated inside 
the bar. We denote this heat production by Q(x) (J/(ms)). In general, the temperature profile 
evolves over time towards a steady state. In this example we are interested in the temperature 
after a long period of time, i.e. the steady-state temperature. For the derivation of the differential 
equation the energy conservation law is applied to the control volume between x and x + Ax (see 
Figure 7.1). 


There is heat transport by conduction through the faces in x and x + Ax. According to Fourier’s 
law this heat transport per unit area and per unit time is called the heat flow density, and equals 


where A (J/(msK)) is called the heat-conduction coefficient. For the control volume between x and 
x + Ax, the energy balance should be satisfied: the total heat outflow at x + Ax minus the total 
heat inflow at x should equal the amount of heat produced in this segment. This can be expressed 
as 


dT dT 
~AAK (x + Ax) + AAq_(x) = AQ(x)Ax. 
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Division by AAx yields 
geet Ax) = at (x) 7 
eg ar 


The corresponding boundary-value problem is obtained by letting Ax — 0: 


AFT (x)= Q(x), O<x<L, 


dt 
1(0)= f, (7.1) 
T(L)=T, 


This example will often be used in this chapter to illustrate the finite-difference method. 


In some applications the heat flux rather than the temperature is known at one of the end points 
of the bar. If this is the case at x = L, then Fourier’s law leads to the boundary condition 


in which q,, (J /(m?s)) is known. 


7.2 The finite-difference method 


The general form of a linear boundary-value problem of the second order in one dimension is 
—(Plx)y'())! + rx)y'(x) + a(a)y(e) = f(x), O<x<L, 
with boundary conditions 
agy(0) + boy'(0) =co and azy(L) + bpy'(L) = cr. 


We assume that p(x) > 0 and q(x) > 0 for all x € [0,L]. Note that the boundary-value problem 
does not have a unique solution when ag = ay = 0 and q(x) =0, 0 < x < L (if solutions exist at 
all). The following terms will be used to describe the boundary conditions (we confine ourselves 
to x = 0): 


Dirichlet boundary condition: agy(0) = co, hence bo = 0, 

Neumann boundary condition: boy’ (0) = co, hence ay = 0, 

Robin boundary condition: a9 #0 and by £0. 
In this chapter, we apply the finite-difference method to approximate the solution to a boundary- 
value problem. The key principle of this approach is to replace all derivatives in the differential 


equation by difference formulae (Chapter 3) and to neglect truncation errors. In this way, a dis- 
crete approximation w for the solution y is obtained. 


Example 7.2.1 (Boundary-value problem with homogeneous Dirichlet boundary conditions) 


In this example, we describe the finite-difference method for the boundary-value problem 
be 

y(0)=0, (7.2) 
1)=0. 


First, the interval [0,1] is divided into n + 1 equidistant subintervals with length Ax = 1/(n+1). 
The nodes are given by x; = jAx for j = 0,...,n + 1 (see Figure 7.2). The approximation of the 
solution to problem (7.2) will be represented on these nodes. 
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x0 x1 x2 ay? Xn  Xn41 
| | | | | 
0 Ax 2Ax nAx 1 


Figure 7.2: Discretization of the domain. The solution is approximated in the nodes X9,...,Xn41- 


The numerical approximation of y; = y(x;) is denoted by w;, which is computed using the dif- 
ferential equation in x;: 


Yj) +4Yj= fy, Sj<n. (7.3) 
Note that the solutions at j = 0 and j = n + 1 are known: yo = Yy41 = 0 (boundary conditions). 


The finite-difference method approximates the second derivative in equation (7.3) by a central- 
difference formula (equation (3.8)) to obtain 


2 Wj-1— 2; + Wj+1 
Ax? 


The values wo and w,+1 follow from the boundary conditions: wo = wy+41 = 0. 


+ qjwj = fir l<j<n. (7.4) 


Difference scheme (7.4) yields n equations for unknowns wW,...,Wn. This relation can be ex- 
pressed in matrix-vector form as 


Aw = f, (7.5) 
with A = K + M, where K and M are given by 
2 —-1 
—1 2 -1 @ 

i ‘ | @ 
* 0) qn 

@ —1 2 -1 

—1 2 


The vectors w and f are: w = (w,...,Wn)| and f = (fi,...,fn)!. 
Example 7.2.2 (Boundary-value problem, nonhomogeneous Dirichlet boundary conditions) 


Similar to example 7.2.1, we investigate the finite-difference method for the boundary-value 


problem 
—y!"(x) +a(x)y(x) = fix), 0<x<1, 
y(0)= a, 
y(1) = B. 


In this case, the finite-difference formula for j = 1 satisfies 


_&— 201 + W2 


Ax2 + 41 = 44; 


which is equivalent to 
—2w, + wo = a 
= ge SIL 


Similarly, the equation for j = n is given by 


B 


_— Wy-1 — 2Wn 
Ax” 


Ax2 


This leads to the following matrix-vector form: 


+ qnuWn =fnut 


Aw =f+r, 
where A, w and f are defined as in example 7.2.1, and r = 1/Ax? - (w,0,...,0, py 
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7.3 Some concepts from Linear Algebra 


In order to estimate the global accuracy of the numerical approximation, it is necessary to first 
recall several concepts from Linear Algebra. Herewith, the difference between the numerical 
approximation w; and the solution y; of the boundary-value problem will be analyzed. 


Definition 7.3.1 (Scaled Euclidean norm of a vector) The scaled Euclidean norm of a vector w € IR” 


is defined as 
1 n 
Iw = 5 


Definition 7.3.2 (Subordinate matrix norm) The natural, or subordinate, matrix norm, related to 
the vector norm ||.|| is defined as 


(AS max Awl: 


Let x = y/|ly||, where y € IR", y 4 0. Then, by definition, ||x|] = 1, and 


|All = fe || Aw|| > ||Axl| = || ATT 


| Tq! = 


is || Ay||. 


Hence, for each vector y € IR", we have 


|| Ay|] < |All - [yl (7.7) 


Condition number 
Suppose we are interested in the vector w, satisfying Aw = f. If the right-hand side f is perturbed 
with an error Af, then as a consequence the solution w will contain an error Aw. This means that 
we solve 

A(w+Aw) =f+Af. 


Using that Aw = f and AAw = Af, inequality (7.7) yields that ||f|| = ||Aw|| < |/Al] - ||w||, such 
that 
1 
||| 


A = 2 
< - and ||Aw|| = |A~*Af|| < ||A~*||- || Af], 
and the relative error equals 
||Aw| 
[|| 


< ||All Aq 1 (78) 


The quantity x(A) = ||Al| - || A~1|| is called the condition number of the matrix A. A large condition 
number implies that a small relative error in the right-hand side f may result in a large relative 
error in the approximation w. 


For a symmetric matrix A, the eigenvalues, A1,...,An, are real valued, and 


1 
A\ll=|A = A; d ||A--|| = —_—_—_ 7.9 
|All =[Amax = max il and JAM) = GE = Se 79) 
1<j<n 


Hence x(A) = |A|max/|A|min: it is sufficient to know the largest and smallest absolute eigenvalue 
(at least approximately) to compute the condition number of a symmetric matrix or an estimate 
to it. 


A useful theorem to estimate these extremal eigenvalues is the Gershgorin circle theorem: 
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Theorem 7.3.1 (Gershgorin circle theorem) The eigenvalues of a general n x n matrix A are located 
in the complex plane in the union of circles 
n 
|z—aii| < } \aiyj| where z€C. 
Ai 
j=l 


Proof 
Suppose Av = Av. By definition, component v; satisfies 


n n 

a AjjOj = Avi, such that Qjj0j + +) AjjOj = AVj. 
j=l [Fi 
j=l 


Therefore, 


n 

=, Ajj0}- 
ix 
j=l 


Next, we assume that v; is the largest component in modulus of v. Using this assumption and 
the triangle inequality (Theorem 1.6.1) it follows that 


|2)| 
|A —ay| < 3 lal ToT = : |4;j||- 
= zi = = 
This relation holds for each eigenvalue and each possible eigenvector. Therefore, the complete 
set of eigenvalues is found in the union of these circles in the complex plane. xX 


Example 7.3.1 (Gershgorin circle theorem) 


In this example, the eigenvalues of the matrix 


2.1 
4=(27) 
are estimated using the Gershgorin circle theorem. Following the theorem, the eigenvalues 


should satisfy |A — 2| < 1. Note that the eigenvalues are A; = 1, Az = 3, which indeed sat- 
isfy this condition. 


7.4 Consistency, stability and convergence 


Next, we want to show that the difference between the numerical approximation and the solution 
tends to zero if the step size Ax tends to zero. To prove this, first some important concepts will 
be defined. 


Definition 7.4.1 (Local truncation error) The local truncation error e of the scheme Aw = f is defined 
as 

e = Ay — Aw = Ay —f, 
where the components y; = y(x;) of the solution y are the exact values in the nodes xj: yj = y(xj;), and 
w is the corresponding yiuinerieal approximation. 


As an example, we investigate the local truncation error for the finite-difference discretization 
(7.4) in example 7.2.1. Using the error estimate for the central-difference discretization of the 
second derivative (Section 3.6), row j of Ay equals 


Yj t 24 — Yin 


2 + qjyj = —yj + O(Ax*) + give 
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such that 
e= Ay—f=-—y"+O(Ax’)+aqy! -£, (7.10) 


and €; = OA") 7 Syst ith 


Definition 7.4.2 (Consistency) A finite-difference scheme is called consistent if 

lim |le|| = 0. 

Ax—0 
For the boundary-value problem in example 7.2.1, ||e|| = O(Ax”), hence system (7.5) is consis- 
tent. 


Definition 7.4.3 (Stability) A finite-difference scheme is called stable if A~| exists and if there exists a 
constant C, independent of Ax, such that 


At] <C, for Ax — 0. 
Example 7.4.1 (Boundary-value problem) 


In this example, example 7.2.1 is revisited. Note that the derived matrix A in this example is 
symmetric. This means that the eigenvalues are real and that (using equation (7.9)) 


1 
LA ain’ 


We investigate two different cases (recall that q(x) > 0 in this chapter): 


A“ = 


e If0 < qmin < q(x) < qmax for all x € [0,1], then the Gershgorin circle theorem implies that 
the eigenvalues of A are located in the union of 


pe ene ea |e 
Axe | Hi} | = age Por 


and because the eigenvalues are real, the following union of intervals is found: 


4, 
qj S254 tq 5 Faden. 


This means that 


4 ’ 
qmin S Aj S Imax + 77 for P= Teng: 


From this it follows that ||A7!|| < 1/qmin, hence the scheme is stable. 
e If q(x) = 0 for all x € [0,1], then the Gershgorin circle theorem is of no use to estimate 


the smallest eigenvalue, since in that case the theorem gives no bound for the matrix norm 
|| A~!||. Without proof we note that the eigenvalues of A = K are 


_ 2—2cos jAx7 


Aj And , j=l,...,n. (7.11) 
This yields 
2—2cos Ax 2 
Amin = — Ax2 VT, 


which implies that also for q = 0 the scheme is stable. 


Definition 7.4.4 (Global truncation error) The global truncation error of a finite-difference scheme is 
defined ase = y — w. 
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Definition 7.4.5 (Convergence) A scheme is called convergent if the global truncation error satisfies 
lim _|le|| = 0. 
Ax—0 


The next theorem provides a relation between the concepts consistency, stability and conver- 
gence. 


Theorem 7.4.1 Ifa scheme is stable and consistent, then that scheme is convergent. 


Proof: 

The global truncation error is related to the local truncation error (Ae = A(y—w) = e), and 
hence e = A~‘e. Taking norms and applying inequality (7.7) yields ||e|| < ||A~‘| |lel|. From 
stability and consistency it then follows that lim,,-,0 |le|| = 0. xX 


Note that consistency in itself is insufficient for convergence. Again it is true, that the order of 
the global truncation error equals the order of the local truncation error. 


Example 7.4.2 (Convergence) 


In this example, the heat transport in a bar (Section 7.1) is described by the boundary-value 
problem 


y(0)=0, (7.12) 
y(1)=0. 
Note that this problem is related to system (7.1), where y denotes the temperature in the bar, and 


the production term Q(x) is given by 25e>* (we expect positive temperatures inside the bar). The 
solution is given by 


| Syl Db, Oe a ed, 


Via) Se? BT ee ee, 


In Figure 7.3(a) the solution is plotted, together with the numerical approximations (computed 
using the finite-difference approach of Section 7.2) for different values of Ax. The numerical 
approximation converges rapidly to the solution. In a practical situation, the accuracy obtained 
with step size Ax = 1/16 will probably be sufficient. 


Next, heat dissipation is included in the boundary-value problem. This leads to 


—y" + 9y= 258%, O<x<1, 
y(0)=0, (7.13) 


in which the term 9y describes heat dissipation to the environment, proportional to the temper- 
ature. The solution and the numerical approximation are plotted in Figure 7.3(b). Note that the 
maximal temperature is definitely lower than without dissipation (Figure 7.3(a)). For the rest, 
there is little difference between the convergence behavior of both problems. 


7.5 Conditioning of the discretization matrix * 


In this section, the boundary-value problem as stated in example 7.2.1 will be revisited to study 
the condition number of the discretization matrix. From example 7.4.1 it follows that the mini- 
mum and maximum eigenvalue when q(x) = 0, 0 < x < 1 are approximated by 


4 
(7Ax)?° 


Booted ~~ 
Amin © 7°, Amax © 


A’ and «(A) & 


If Ax tends to zero, then the condition number of A is unbounded which is not desirable, since 
this would mean that perturbations in the right-hand side f could lead to arbitrarily large errors 
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Exact 
70} + Ax = 0.25 

o- Ax = 0.125 
«- Ax = 0.0625 


Exact 
+ Ax = 0.25 

o Ax = 0.125 
«- Ax = 0.0625 


0.6 0.8 E . 
x x 


0 02 04 
(a) Without dissipation (b) With dissipation 
Figure 7.3: Exact solution and numerical approximation of the temperature distribution in a bar, 


without (problem (7.12)) or with (problem (7.13)) dissipation. 


in the approximation. It turns out that the derivation of the relative error in equation (7.8) was 
too pessimistic. A more realistic estimate is found by writing 


= 1 
|| Aw|| = | A7'Af|| < —||Af|]. 
|A| min 
Hence the relative error satisfies 
|| Aw]| 1 [fil ||Afll || Af|| 
< oe rel A. 
Pw <n wl ne eA) Tg 


The role of «(A) in this expression is taken over by the effective condition number 


mef(A) = 5 — Or 


We know that Amin *& 77 and in many applications ||f||/||w|| is bounded as Ax tends to zero, 
hence the effective condition number is, in fact, bounded. Although this estimate is more accurate 


than the original condition number, it is more difficult to compute, since (an estimate for) w is 
required. 


7.6 Neumann boundary condition 


In the previous sections, boundary-value problems with two Dirichlet boundary conditions were 
considered. In this section, we provide a method to deal with Neumann boundary conditions. 


Example 7.6.1 (Boundary-value problem with Neumann boundary condition) 


Here, the boundary-value problem of example 7.2.1 is slightly modified to include a Neumann 
boundary condition on the right boundary: 


=0, (7.14) 
0. 
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In this case, W,41 is no longer known, as it was for a Dirichlet boundary condition. Therefore, the 
discretization should also include an equation for j = n + 1, such that a system of n + 1 equations 
with n + 1 unknowns, wy1,...,W,+1 has to be solved. For j = n + 1 we have 


—Wy + 2Wy41 — Wn42 
dnt = fsa (7.15) 
ie 

Note that w,+42 does not exist, since X¥;+1 = 1 is the right boundary of the domain. In order to 
approximate w,,1 properly, a virtual point x,42 = (n+ 2)Ax = 1+ Ax is introduced (see Figure 
7.4). 


xo x1 x2 tee Xn Xn+1 Xn+2 


| 
0 Ax 2Ax nAx 1 1+Ax 


Figure 7.4: Discretization of the domain, with a virtual node x,+2. The solution is approximated 
in the nodes xo,...,Xy41- 


The value of w,,+2 follows from discretization of the Neumann boundary condition using central 
differences (equation (3.5)): 
Wnt+27 Wn _ 0 
2Ax 
This implies w,+2 = Wy. We substitute this into (7.15) and divide by two (to preserve symmetry 
of the matrix) to obtain 


—WytWy4, 1 1 
a + 5 Intent = anti (7.16) 


This means that we should solve Aw = f, where the symmetric matrix A can be written as 
A =K-4+M, with 


P23 
ae ee ® N1 2 
7 . 
K=735 and M= . F 
® a4; he 2 ae 


4n4+1 
2 


w= (w1,. . pip and f = (fi,- : pte psd 2 


Next, the error behavior for this system of equations is considered. Recall from Section 7.4 that 
the discretization of the internal domain and Dirichlet boundary yields a local truncation error of 
order O(Ax?) (equation (7.10)). For the Neumann-boundary discretization in equation (7.16), the 
local truncation error can be computed using the Taylor expansion of yy and y,+2 about Xy4+1: 


Ax? Ax 

Yn = Yn41 — AXYn 41 + n+ z= arnt + O(Ax*), 
Ax? Ax 

Yn42 = Yn41 + Axy nay + Unt + ar Yat + O(Ax*). 


Using that y/,,, = 0, this means that yn42 = Yn + O(Ax°), such that the replacement of +2 by 
Yn in the discretization of the Neumann boundary yields a local truncation error of order O(Ax). 
This means that the local truncation error can be written as 


Ay —f =e = Ax’u+t Axv, 
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where Ax7u results from the inner local truncation error (as was already found in equation (7.10)), 
and the component Axv belongs to the local truncation error due to the Neumann boundary: 
v = (0,...,0,0n41)!, with 041 = O(1). Note that the order of the local truncation error is 
different from the order of a boundary-value problem with pure Dirichlet boundary conditions. 


It turns out that the global truncation error still remains of order O(Ax?). This claim will be 
shown for q(x) = 0, 0 < x < 1, which results in a stable system. The global truncation error is 
split as follows: 


e=Ate=A7! (4x70 + Axv) = el) + (2), 


By construction, ||e) || = O(Ax?). For e), it holds that et”) = Ax80n41j,) = 1,...,n +1. This 
can be shown by verifying that Ae?) = Axv. The jth component of Ae?) equals 

(2) (2) __ (2) : ei 
ja bef Get Ax nga (-G— 1) +2j- G40) 


So — =0, 
J Ax? Ax? 


(Ae?) 


if 7 = 2,...,n, and similarly for j = 1, (Ae?)), = 0. For j =n+1, we have 
—ey) + ay _ Ax 0nyi(—n + (n +1) 


2: = 
(Ael ee = Ax2 Ax2 


= AXUn 41, 


hence indeed, Ae'2) = Axv. 


By construction, Axj < 1 (size of domain), such that a < Ax?0,,41. Therefore, ||e'?)|| = O(Ax?), 


and the global truncation error satisfies ||e|| = O(Ax?). 


For q # 0, the same orders of the local and global truncation error are found, but the details are 
beyond the scope of this book. 


7.7 The general problem * 


In this section, a general boundary-value problem on [0,1] is considered, given by 


—(p(x)y'(x))! + r(x)y!(x) + q(x) y(x) = f(x), O<x<1, 
y(0) =a, 
p(1)y'(1) = B, 


where we assume p(x) > 0 on [0,1]. 

The discretization in an interior node x; reads 
Pia (Wj41 = wj) ne Py-3 (w; a wj-1) Wi44 = Wj_-1 ; 
{he oa j=l,...,n4+1. (7.17) 


Note that if rj # 0, then the discretization matrix is not symmetric. 


The boundary condition in x = 0 yields wo = « which may be substituted directly into (7.17) for 
oe 


For j = n +1, the discretization is 


Pays (Wn42 — Wn41) + Prt (Wn+1 — Wn) Wy+2 — WwW 


Ay +Tn41 “+ 9n+1Wn41 = fnti- (7.18) 


2Ax 
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As in Section 7.6, a virtual grid point, x42, is added to the domain. In order to remove the term 
—Pn43/2(Wn+2 — Wn+1) in equation (7.18), we discretize boundary condition p(1)y/(1) = B by 
averaging the approximations for 


; Ax 7 Ax 
Pusiy (-=) and Pnt3y +> = 


Wy+1— Wn Wn42— Wn41 a 
5 (Pass Ax | Pn Ax ) B, 


This means that 


ray 


such that 
P43 (Wnt2— Wnt) = —Pyyd (nti — Wn) + 2AxB. 


Replacing this term in equation (7.18) and using that (wy+2 — Wn) /(2Ax) = B/p(1) yields 


2P nd (Wn41 aa Wn) 1n41B 


2 
Ax2 p(1) + G@n41Wns1 = faz + i 


Ax? 


7.8 Convection-diffusion equation 


In the previous section the discretization for a general boundary-value problem was derived 
by applying a central-difference scheme. However, if a second-order boundary-value problem 
also contains a first-order derivative, this may lead to (physically) unacceptable approximations. 
This will be illustrated both analytically and numerically for a specific type of boundary-value 
problems for the convection-diffusion equations. 


Example 7.8.1 (Convection-diffusion equation) 


An analytical investigation of boundary-value problems for convection-diffusion equations will 
be established using 


y(0)=0, (7.19) 
y(=1, 
with v € R. This equation can be interpreted as a heat-transfer problem. Heat transport is caused 
by conduction (diffusion, the term —y’"), and by a flowing medium (convection, the term vy’). 


—y"(x) + oy'(x)=0, O<x<1, 


It can be verified that the solution to this boundary-value problem is given by 


e’* — J] 
y(x) = w-1 


The solution increases strictly over the region [0, 1] and therefore numerical approximations that 
contain any oscillations will be rejected. However, a careless discretization of boundary-value 
problem (7.19) can result in a numerical approximation that contains (spurious) oscillations. 


Central differences 
Using the same subintervals and nodes as in example 7.2.1, the central-difference discretization 
at interior nodes x j is given by 


Wi-1 — 20; + W; Wj41 — Wi 
jo! J j+1 j+1 j-l1 : 
— = 0, GH 1... U1. 7.20 
Ax? +O Dax é ‘ ee 
Together with the boundary conditions, this leads to a system of equations Aw = f, in which 
A 
2 al" 
So eee 2 
ne < 
A A 
) ae A 2 =a 
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w = (w1,...,Wn)', and f =1/Ax?- (0,...,0,1—vAx/2)". 

Equation (7.20) is a linear difference equation, and it is known that its solution is of the form 
w; = r!, for some unknown quantity r. Substitution of w; = 1! into equation (7.20) yields after 
multiplication by Ax?: 


pr doyle td COR (rt — 1) =p z (7? — 2r +1) 4 KP -1)) =) 


which is satisfied if 
vAx 


r=0 or Gal Sst Dir 1): (7.21) 
The solution r = 0 is the trivial solution. The roots of interest are given by 
1+ 
n=1 m= aeoane? vAx # 2. (7.22) 
2 


The solution to difference equation (7.20) can be written as w; = ar! + br',. The constants a and b 
follow from the boundary conditions wo = 0 and w,+4 = 1, such that we arrive at 


w; = ——2.. (7.23) 


Note that w; becomes oscillatory if rz < 0, since r, in that case is positive if j is even, and neg- 
ative if j is odd. This happens if |v|Ax > 2. Hence, in order to obtain monotone solutions, we 
require |v|Ax < 2. In practice, the velocity is a given parameter, such that stability may require 
impractically small subintervals of size Ax < 2/|v]. 


Upwind differences 
A common remedy is to use upwind discretizations. In this discretization, oy; is approximated 
using one-sided differences: 


pe if v>0O (upwind difference), 
vy; fe Si 
“i "i if v <0 (downwind difference). 


For v > 0, the corresponding discretization equals 


W;_1 —2w; +; W; — Wj 
j-1 j j+1 ] j-1 
— St op ts = 0, 7.24 
Ax? Ax (eee) 
Application of the same trial solution w= ri, division by pil, multiplication by Ax? and some 
rearrangement results in the non-trivial roots 


Ax = 1) SS O=1)7, 


with solutions r; = 1, r2 = 1+ vAx > 0. These values reassure that the numerical approximation 
(7.23) is non-oscillatory, and therefore admissible for all possible choices of Ax > 0. 


A similar conclusion can be drawn for v < 0. 


An example of the results using either central or upwind differences is depicted in Figure 7.5. 
Here, we take v = 30 and Ax = 0.1, such that |v|Ax > 2. Note that indeed, the central-difference 
approximation is oscillatory, whereas the upwind-difference approximation is monotone. 


It should be noticed that despite the desired physically correct behavior of the numerical approxi- 
mation, the global truncation error of the upwind-difference discretization is of order O(Ax) only. 
Thus, the price to pay for non-oscillatory physically meaningful approximations with reasonable 
values for Ax is a lower order of accuracy. Without further details, we state that in practice more 
advanced methods are used to solve convection-diffusion equations. 
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Figure 7.5: Approximation of the solution to convection-diffusion problem (7.19) for v = 30, 
Ax = 0.1. 


Example 7.8.2 (Convection-diffusion equation) 
In this example, the numerical results for the boundary-value problem 
—y"+v0y=1, 0<x<1, 
y(0)=0, (7.25) 
y(1)=0. 
will be computed, using both a central and an upwind discretization. The solution to this convec- 
tion-diffusion equation is given by 
1 Le" 
yo) =F (2-5). 
Although the solution to this convection-diffusion equation is not monotone as was the case in 
example 7.8.1, a similar behavior is found. 
Approximations of the solution are determined with stepsize Ax = 0.1 for various values of the 


velocity v (see Figure 7.6). 


For v = 10, we have vAx < 2. Therefore, the central-difference approximation does not introduce 
any spurious oscillations. Since the global truncation error is of order O(Ax?), this results into a 
more accurate approximation than upwind discretization (which is of order O(Ax)). If v = 20, 
then vAx = 2, such that the discretization at node xy is given by 


Wy—1 — 2Wn + Wns Wn41 7 Wn-1 2 2 
ea cd A Bd SS = —_ =1 
Ax2 Lamy. Agel) age 


Since the term with w,.1 has been cancelled, the boundary condition in x = 1 does not have any 
effect on the rest of the central-difference approximation. However, for this example the result 
is still quite accurate. For v = 100 the central-discretization scheme produces oscillations in the 
approximation, and the errors are large. A more accurate approximation for v = 100 is found 
using the upwind-discretization scheme. 


7.9 Nonlinear boundary-value problems 


Consider a nonlinear boundary-value problem of the following form: 


—(py')'+ay,y,x)=0, 0<x<l, 
y(0)=0, (7.26) 


y(1)=0. 
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0.07 : 0.05 
Exact * Exact 
0.06) * Central | 0.045) + Central : 
°- Upwind : é * o.o4}} ° ~Upwind 
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$ 
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0.016 1 + 
Exact 
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o Upwind 
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(c) v = 100 


Figure 7.6: Approximation of the solution to convection-diffusion problem (7.25) for different 
values of v, Ax = 0.1. 


The general approach to solve (7.26) uses an iterative process of Chapter 4 in which only linear 

equations have to be solved. The derived linear boundary-value problems are solved using the 

finite-difference schemes as explained in earlier sections. The iterative method generates a series 
1 


of iterates y), y,... in such a way that y(") > y, m — co where y is the solution to (7.26). 
7.9.1 Picard iteration 
The iterative method of Picard approximates the solution to boundary-value problem (7.26) by 
determining y™) from 


a yy = g(r ged): x) 
yw"r'(0 


Here, the finite-difference approach is used to approximate the derivative of the m — 1th and the 
mth iterate. 
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7.9.2 Newton-Raphson method 


The Newton-Raphson method is based on linearization of g about (y("—), y("—-)): 


aly yx) & gy DY yD, x) + (yl = yD) $ (y= yD) g 


in which 
_ OF 6 m—1)0 (m1) — 98 (m1) (n—1) 
C= ayt ry", x) and q= ay ry", X). 


The corresponding iterative method determines y") from 


| — (py)! et ry! 4 eo = yy Ae y@Vg = (yal ged), x): 
y(0) = 0, 


7.10 Summary 


In this chapter, the following subjects have been discussed: 


- Boundary-value problems; 

- Dirichlet, Neumann, Robin boundary condition; 

- Finite-difference schemes; 

Norms, condition numbers, Gershgorin circle theorem; 
- Local truncation error, consistency; 

- Stability; 

Global truncation error, convergence; 
Implementation of Neumann boundary condition; 
Convection-diffusion problem; 

- Upwind discretization; 

- Nonlinear boundary-value problem. 


7.11 Exercises 
1. Consider matrices A; and Ap given by 
2-1 100 99 
a= (j 5) and Az = ( 99 ne 
Determine the condition number of both matrices. Determine the solution to Agw = f in 


which f = (199,199)'. Take Af = (1,0)'. Estimate ||Aw|| using the condition number. 
Determine Aw and compare ||Aw|| with your estimate. 


2. Consider the boundary-value problem 


| y"(x)=2(y(x)—x), O<x<1 


(a) By discretization we obtain a system of linear equations Aw = f. Determine A and f 
for a general value of Ax. 


(b) Estimate the largest and smallest eigenvalue of A. 
(c) Estimate ||Aw]| /||w|| if || Af|| /||£|| < 1074. 
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3. Consider the boundary-value problem 


Determine the solution y, and the numerical approximation w, with n = 2. Compute 
Yj = Wi, J = 1,2. 


4. Consider the boundary-value problem 


—y"+y=0, 0<x<1, 
1 
0 


Discretize this problem using the nodes x9 = 0, x1 = 2/7, x2 = 4/7, x3 = 6/7 and 
x4 = 8/7. Determine the discretization matrix A. Is A symmetric or nonsymmetric? Are 
the eigenvalues positive? 


Chapter 8 


The instationary heat equation * 


8.1 Introduction 


The time-dependent temperature distribution in a bar can be described by a parabolic partial 
differential equation. To solve that kind of initial- and boundary-value problems methods from 
Chapters 6 and 7 will be combined. 


As an example, we consider the temperature distribution in a bar (Figure 8.1). 


0 yx + Ax L 


Figure 8.1: Bar with length L and cross-sectional area A. 


We denote the temperature by T(x, t) (measured in K). We assume that the temperature is known 
at both ends T(0,t) = T; and T(L,t) = T,, and that the initial temperature at t = 0 is given by 
T(x,t) = To(x). The heat production in the bar is denoted by Q(x,t) (J/(m°s)). To derive the 
heat equation we use the energy conservation law applied to the control volume between x and 
x + Ax and between the points in time t and t + At. According to Fourier’s law the heat flow 
density is given by 

oT 

q(x,t) = Ax (x,t), 

where A(J/(msK)) is the heat-conduction coefficient. 


The energy balance on the control volume is given by 


oT oT 
pcT(x,t+ At)AAx = pcT (x,t) AAx — AZ (x, t) AAt + Ax (x + Ax, t)AAt + Q(x, t)AAxAt, 


in which p(kg/m*) is the mass density, and c(J/(kgK)) is the specific heat. After dividing by 
AAxAt and rearranging the terms we obtain 


potest At) — T(x,t) Peracs + Ax,t) — ££ (x,t) 


At = Ax +Q. 
Taking the limits Ax — 0 and At — 0 yields the following initial-boundary-value problem: 
pc = Akt +Q, O<x<L, 0<t, 
T(0,f)= T(t), T(L,t) = T(t), 0<t, 
T(x,0) = To(x), O<x<L. 
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8.2 Semi-discretization 


In this chapter, the space and time discretizations will be exhibited for a simple heat problem. 
There is a clear difference between an explicit (Forward Euler) and an implicit (Backward Euler) 
method of time integration. 


Example 8.2.1 (Heat-conduction problem) 


We consider the following heat-conduction problem: 


oy — Hy 0<x<1, 0<t<T, 
yOH=y.h; yO), OAS 7, (8.1) 
y(x,0) = yo(x), 0<x<1. 


We discretize (8.1) in the x-direction by means of the finite-difference method (Chapter 7). The 
interval [0,1] is divided into n + 1 equal parts with length Ax, where the nodes are given by 
x; = iAx,i=0,...,n +1. We denote the numerical approximation of y(x;,t) by u;(t). The vector 
u(t) is given by u(t) = (u,(t),...,Un(t))'. The values at the boundary points are omitted, 
because they are known from the boundary conditions. Using the techniques from Chapter 7 we 
obtain the following system of first-order ordinary differential equations 
{ ™—Kut+r, 0<t<T, (8.2) 
u(0) = yo. 


The matrix K and vector r are given by 


—2 1 
1 -2 1 2) 
1 as 
reer, . and = 52 (y(t), 0,---,0,yr(t)) 
7) 1 -2 1 


1 -2 


This is called semi-discretization (or method of lines), because discretization has been applied to 
the spatial x-direction but not to the temporal f-direction. Note that the eigenvalues A; of K 
satisfy (cf. the Gershgorin circle theorem) A; < 0 for all j = 1,...,n—1. This implies stability 
of the semi-discrete system. In the boundary-value problems of Chapter 7, —y’’ was considered. 
Therefore, the matrix K differs in sign from the matrix in equation (7.6). 


8.3 Time integration 


System (8.2) may be integrated in time by one of the methods from Chapter 6. The solution 
is approximated at times t; = jAt, using m time steps of length At = T/m. The numerical 


approximation of the temperature at position iAx and time jAf is denoted by wl (which ap- 
proximates u;(t;), and hence y(x;,t;)). Due to initial and boundary conditions, it is known that 


w? = yo(xi), W = yr(tj) and w;,.1 = yR(t))- 


8.3.1 Forward Euler method 


First, the Forward Euler method will be considered. From Chapter 6 (equation (6.51a)) it follows 
that the approximation of the solution to problem (8.2) at t;,; equals 


w!*! — wi + At(Kw! +1’). (8.3) 
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Note that wit depends on the approximations in (x;_1,¢;), (xj, t;) and (%;+1,#;) (as follows from 
the definition of K). This idea is depicted in Figure 8.2(a). 


The local truncation error of this method can be computed using definition (6.68) in Chapter 6: 


yitl — zit 


qitl ; 
At 


where z/*! is computed by applying the Forward Euler method to the solution at time tj (given 
by y/), hence z/+1 = y/ + At(Ky/ +1/). This yields (using the Dirichlet boundary conditions) 


i+] _ (yi + At(Ky/ + r/)) j+1_ vi ee dy .  Qeyl 
pa NE SE a Ry gh SS ig eg 
i At At CT Epp te gh eee 
in which 
ba Ab GF Aged? 
ej = = 53 (iG) Gj; € (tj tj), and p= Gi tj), Ci © (Xi-1, Xi41) 


follow from the truncation errors of the numerical differentiation, equations (3.2) and (3.9). 


Using the heat-conduction problem, we find that t/+! = ¢/ + p/, such that method (8.3) is of 
order O(At + Ax’). It makes sense to choose At and Ax in such a way that both components in 
the truncation error have about the same size. This already suggests that the time step should be 
divided by four if the spatial step size is halved. 

Stability 

Since the Forward Euler method is only conditionally stable, it is important to determine the 


size of the time step in such a way that no instabilities will occur. The matrix K from (8.2) is 
symmetric, hence the eigenvalues are real. From the Gershgorin circle theorem it follows that 


4 
gn eS for 1<i<n. 


The matrix has no positive eigenvalues, hence the system of differential equations (8.2) is an- 
alytically stable (Section 6.8.2). An explicit expression for the eigenvalues of —K was given in 
equation (7.11). For real eigenvalues, the Forward Euler method is stable if the time step Aft 
satisfies (equation (6.18)) 
2 
AIS —y, Vi=1,...,n. 

Using that |A|max = 4/ Ax2, we obtain At < Ax2/2. Note that if the spatial step size is halved, 
then the time step has to be taken four times as small to satisfy the stability condition. 


Note that the condition number of K is given by (Section 7.3) 


maxi=t..n Ail 4 


K K => : ~ a 
a) minj=t,.n [Ai] (Ax)? 


For small values of Ax the initial-value problem (8.2) is a stiff system, since there is a large differ- 
ence between |A|max and |A|min- Because of the stability restrictions for explicit methods, it might 
be more convenient to use an implicit method in that case (Section 6.10). 


8.3.2 Backward Euler method 
Applying the Backward Euler method (equation (6.51b)) to (8.2) we obtain the formula 


w/t! — wi + At(Kwit! 4 r/*4), (8.4) 
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such that w/t! needs to be solved from the expression 
(I — AtK) w/t! = wi + Ate/*, (8.5) 


The dependency between approximations in this case is depicted in Figure 8.2(b). Again the local 
truncation error is of order O(At + Ax). Note that the Backward Euler method is uncondition- 
ally stable, hence the time step may be chosen in accordance with the accuracy required (stiffness 
does not impair numerical stability). 


Note that the solution to (8.5) is computationally more costly than determining w/*? using the 
Forward Euler method. However, since the matrix K contains many zeros, there are efficient 
methods to solve problem (8.5). Careful consideration is necessary: cheap approximation with 
many time steps (Forward Euler method) versus expensive approximation with few time steps 
(Backward Euler method). 


8.3.3 Trapezoidal method 


It is also possible to apply the Trapezoidal method for time integration. In the literature the 
Trapezoidal method applied to the heat equation is often called the Crank-Nicolson method. The 
corresponding formula that needs to be solved is 


At a ; 
witt = wi + =e (Kw +7 +Kwi*! +141) P 


cf. equation (6.51c). The corresponding dependency between the approximations is depicted in 
Figure 8.2(c). 


tT tT t 
tj41 x * ti4d x xX xX * ti+1% x xX xX * 
tj Ko x tj * x tj + x x 
* * * * 
0. x1%) Xi a % O Maik Xie aia % 0. Xi 1%; Xin aa x 
(a) Forward Euler method (b) Backward Euler method (c) Trapezoidal method 


Figure 8.2: Dependence of approximations using central-difference semi-discretization and time 
integration. x: locations of approximations used in time-integration scheme. o: known by initial 
condition, *: known by boundary condition. 


8.4 Summary 


In this chapter, the following subjects have been discussed: 


- Instationary heat equation; 

- Semi-discretization; 

Time integration: Forward Euler, Backward Euler, and Crank-Nicolson method; 
Local truncation error; 


Stability. 
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